Streaming input engine facilitating data transfers between application engines and memory

ABSTRACT

A multi-processor includes multiple processing clusters for performing assigned applications. Each cluster includes a set of compute engines, with each compute engine coupled to a set of cache memory. A compute engine includes a central processing unit and a coprocessor with a set of application engines. The central processing unit and coprocessor are coupled to the compute engine&#39;s associated cache memory. The sets of cache memory within a cluster are also coupled to one another.

[0001] This application is a continuation of U.S. patent applicationSer. No. 09/900,481, entitled “Multi-Processor System,” filed on Jul. 6,2001, which is incorporated herein by reference.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0002] This Application is related to the following Applications:

[0003] “Coprocessor Including a Media Access Controller,” by FrederickGruner, Robert Hathaway, Ramesh Panwar, Elango Ganesan and Nazar Zaidi,Attorney Docket No. NEXSI-01021US0, filed the same day as the presentapplication;

[0004] “Application Processing Employing A Coprocessor,” by FrederickGruner, Robert Hathaway, Ramesh Panwar, Elango Ganesan, and Nazar Zaidi,Attorney Docket No. NEXSI-01201US0, filed the same day as the presentapplication;

[0005] “Compute Engine Employing A Coprocessor,” by Robert Hathaway,Frederick Gruner, and Ricardo Ramirez, Attorney Docket No.NEXSI-001202US0, filed the same day as the present application;

[0006] “Streaming Output Engine Facilitating Data Transfers BetweenApplication Engines And Memory,” by Ricardo Ramirez and FrederickGruner, Attorney Docket No. NEXSI-01204US0, filed the same day as thepresent application;

[0007] “Transferring Data Between Cache Memory And A Media AccessController,” by Frederick Gruner, Robert Hathaway, and Ricardo Ramirez,Attorney Docket No. NEXSI-01211US0, filed the same day as the presentapplication;

[0008] “Processing Packets In Cache Memory,” by Frederick Gruner, ElangoGanesan, Nazar Zaidi, and Ramesh Panwar, Attorney Docket No.NEXSI-01212US0, filed the same day as the present application;

[0009] “Bandwidth Allocation For A Data Path,” by Robert Hathaway,Frederick Gruner, and Mark Bryers, Attorney Docket No. NEXSI-01213US0,filed the same day as the present application;

[0010] “Ring-Based Memory Requests In A Shared Memory Multi-Processor,”by Dave Hass, Frederick Gruner, Nazar Zaidi, Ramesh Panwar, and MarkVilas, Attorney Docket No. NEXSI-01281US0, filed the same day as thepresent application;

[0011] “Managing Ownership Of A Full Cache Line Using A Store-CreateOperation,” by Dave Hass, Frederick Gruner, Nazar Zaidi, and RameshPanwar, Attorney Docket No. NEXSI-01282US0, filed the same day as thepresent application;

[0012] “Sharing A Second Tier Cache Memory In A Multi-Processor,” byDave Hass, Frederick Gruner, Nazar Zaidi, and Ramesh Panwar, AttorneyDocket No. NEXSI-01283US0, filed the same day as the presentapplication;

[0013] “First Tier Cache Memory Preventing Stale Data Storage,” by DaveHass, Robert Hathaway, and Frederick Gruner, Attorney Docket No.NEXSI-01284US0, filed the same day as the present application; and

[0014] “Ring Based Multi-Processing System,” by Dave Hass, Mark Vilas,Fred Gruner, Ramesh Panwar, and Nazar Zaidi, Attorney Docket No.NEXSI-01028US0, filed the same day as the present application.

[0015] Each of these related Applications are incorporated herein byreference.

BACKGROUND OF THE INVENTION

[0016] 1. Field of the Invention

[0017] The present invention is directed to processing network packetswith multiple processing engines.

[0018] 2. Description of the Related Art

[0019] Multi-processor computer systems include multiple processingengines performing operations at the same time. This is very useful whenthe computer system constantly receives new time-critical operations toperform.

[0020] For example, networking applications, such as routing, benefitfrom parallel processing. Routers receive multiple continuous streams ofincoming data packets that need to be directed through complex networktopologies. Routing determinations require a computer system to processpacket data from many sources, as well as learn topological informationabout the network. Employing multiple processing engines speeds therouting process.

[0021] Another application benefiting from parallel processing isreal-time video processing. A computer video system must perform complexcompression and decompression operations under stringent timeconstraints. Employing multiple processors enhances system performance.

[0022] Parallel processing requires: (1) identifying operations to beperformed, (2) assigning resources to execute these operations, and (3)executing the operations. Meeting these requirements under time andresource constraints places a heavy burden on a computer system. Thesystem faces the challenges of effectively utilizing processingresources and making data available on demand for processing.

[0023] Over utilizing a system's processors results in long queues ofapplications waiting to be performed. Networking products employingtraditional parallel processing encounter such processor utilizationproblems. These systems assign each incoming packet to a singleprocessor for all applications. General processors, instead ofspecialized engines, perform applications requiring complextime-consuming operations. When each processor encounters a packetrequiring complex processing, system execution speed dropssubstantially—processing resources become unavailable to receive newprocessing assignments or manage existing application queues.

[0024] Memory management also plays an important role in systemperformance. Many systems include main memory and cache memory, which isfaster than main memory and more closely coupled to the system'sprocessors. Systems strive to maintain frequently used data in cachememory to avoid time-consuming accesses to main memory.

[0025] Unfortunately, many applications, such as networkingapplications, require substantial use of main memory. Networking systemsretrieve data packets from a communications network over acommunications medium. Traditional systems initially store retrieveddata packets in a local buffer, which the system empties into mainmemory. In order to perform applications using the data packets, thesystem moves the packets from main memory to cache memory—a timeconsuming process.

[0026] Traditional systems also incur costly memory transfer overheadwhen transmitting data packets. These systems transfer transmit packetdata into main memory to await transmission, once processor operation onthe data is complete—forcing the system to perform yet another mainmemory transfer to retrieve the data for transmission.

[0027] A need exists for a parallel processing system that effectivelyutilizes and manages processing and memory resources.

SUMMARY OF THE INVENTION

[0028] A multi-processor in accordance with the present inventionefficiently manages processing resources and memory transfers. Themulti-processor assigns applications to compute engines that are coupledto cache memory. Each compute engine includes a central processing unitcoupled to coprocessor application engines. The application engines arespecifically suited for servicing applications assigned to the computeengine. This enables a compute engine to be optimized for servicing theapplications it will receive. For example, one compute engine maycontain coprocessor application engines for interfacing with a network,while other coprocessors include different application engines.

[0029] The coprocessors also offload the central processing units fromprocessing assigned applications. The coprocessors perform theapplications, leaving the central processing units free to manage theallocation of applications. The coprocessors are coupled to the cachememory to facilitate their application processing. Coprocessors exchangedata directly with cache memory—avoiding time consuming main memorytransfers found in conventional computer systems. The multi-processoralso couples cache memories from different compute engines, allowingthem to exchange data directly without accessing main memory.

[0030] A multi-processor in accordance with the present invention isuseful for servicing many different fields of parallel processingapplications, such as video processing and networking. One example of anetworking application is application based routing. A multi-processorapplication router in accordance with the present invention includescompute engines for performing the different applications required. Forexample, application engines enable different compute engines to performdifferent network services, including but not limited to: 1) virtualprivate networking; 2) secure sockets layer processing; 3) web caching;4) hypertext mark-up language compression; and 5) virus checking.

[0031] These and other objects and advantages of the present inventionwill appear more clearly from the following description in which thepreferred embodiment of the invention has been set forth in conjunctionwith the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0032]FIG. 1 illustrates a multi-processor unit in accordance with thepresent invention.

[0033]FIG. 2 illustrates a process employed by the multi-processor unitin FIG. 1 to exchange data in accordance with the present invention.

[0034]FIG. 3 shows a processing cluster employed in one embodiment ofthe multi-processor unit in FIG. 1.

[0035]FIG. 4 shows a processing cluster employed in another embodimentof the multi-processor unit in FIG. 1.

[0036]FIG. 5a illustrates a first tier data cache pipeline in oneembodiment of the present invention.

[0037]FIG. 5b illustrates a first tier instruction cache pipeline in oneembodiment of the present invention.

[0038]FIG. 6 illustrates a second tier cache pipeline in one embodimentof the present invention.

[0039]FIG. 7 illustrates further details of the second tier pipelineshown in FIG. 6.

[0040]FIG. 8a illustrates a series of operations for processing networkpackets in one embodiment of the present invention.

[0041]FIG. 8b illustrates a series of operations for processing networkpackets in an alternate embodiment of the present invention.

[0042]FIGS. 9a-9 c show embodiments of a coprocessor for use in aprocessing cluster in accordance with the present invention.

[0043]FIG. 10 shows an interface between a CPU and the coprocessors inFIGS. 9a-9 c.

[0044]FIG. 11 shows an interface between a sequencer and applicationengines in the coprocessors in FIGS. 9a-9 c.

[0045]FIG. 12 shows one embodiment of a streaming input engine for thecoprocessors shown in FIGS. 9a-9 c.

[0046]FIG. 13 shows one embodiment of a streaming output engine for thecoprocessors shown in FIGS. 9a-9 c.

[0047]FIG. 14 shows one embodiment of alignment circuitry for use in thestreaming output engine shown in FIG. 13.

[0048]FIG. 15 shows one embodiment of a reception media accesscontroller engine in the coprocessor shown in FIG. 9c.

[0049]FIG. 16 illustrates a packet reception process in accordance withthe present invention.

[0050]FIG. 17 shows a logical representation of a data management schemefor received data packets in one embodiment of the present invention.

[0051]FIG. 18 shows one embodiment of a transmission media accesscontroller engine in the coprocessors shown in FIG. 9c.

[0052]FIG. 19 illustrates a packet transmission process in accordancewith one embodiment of the present invention.

[0053]FIG. 20 illustrates a packet transmission process in accordancewith an alternate embodiment of the present invention.

DETAILED DESCRIPTION

[0054] A. Multi-Processing Unit

[0055]FIG. 1 illustrates a multi-processor unit (MPU) in accordance withthe present invention. MPU 10 includes processing clusters 12, 14, 16,and 18, which perform application processing for MPU 10. Each processingcluster 12, 14, 16, and 18 includes at least one compute engine (notshown) coupled to a set of cache memory (not shown). The compute engineprocesses applications, and the cache memory maintains data locally foruse during those applications. MPU 10 assigns applications to eachprocessing cluster and makes the necessary data available in theassociated cache memory.

[0056] MPU 10 overcomes drawbacks of traditional multi-processorsystems. MPU 10 assigns tasks to clusters based on the applications theyperform. This allows MPU 10 to utilize engines specifically designed toperform their assigned tasks. MPU 10 also reduces time consumingaccesses to main memory 26 by passing cache data between clusters 12,14, 16, and 18. The local proximity of the data, as well as theapplication specialization, expedites processing.

[0057] Global snoop controller 22 manages data sharing between clusters12, 14, 16, and 18 and main memory 26. Clusters 12, 14, 16, and 18 areeach coupled to provide memory requests to global snoop controller 22via point-to-point connections. Global snoop controller 22 issues snoopinstructions to clusters 12, 14, 16, and 18 on a snoop ring.

[0058] In one embodiment, as shown in FIG. 1, clusters 12, 14, 16, and18 are coupled to global snoop controller 22 via point-to-pointconnections 13, 15, 17, and 19, respectively. A snoop ring includescoupling segments 21 ₁₋₄, which will be collectively referred to assnoop ring 21. Segment 21 ₁ couples global snoop controller 22 tocluster 18. Segment 21 ₂ couples cluster 18 to cluster 12. Segment 21 ₃couples cluster 12 to cluster 14. Segment 21 ₄ couples cluster 14 tocluster 16. The interaction between global snoop controller 22 andclusters 12, 14, 16, and 18 will be described below in greater detail.

[0059] Global snoop controller 22 initiates accesses to main memory 26through external bus logic (EBL) 24, which couples snoop controller 22and clusters 12, 14, 16, and 18 to main memory 26. EBL 24 transfers databetween main memory 26 and clusters 12, 14, 16, and 18 at the directionof global snoop controller 22. EBL 24 is coupled to receive memorytransfer instructions from global snoop controller 22 overpoint-to-point link 11.

[0060] EBL 24 and processing clusters 12, 14, 16, and 18 exchange datawith each other over a logical data ring. In one embodiment of theinvention, MPU 10 implements the data ring through a set ofpoint-to-point connections. The data ring is schematically representedin FIG. 1 as coupling segments 20 ₁₋₅ and will be referred to as dataring 20. Segment 20 ₁ couples cluster 18 to cluster 12. Segment 20 ₂couples cluster 12 to cluster 14. Segment 20 ₃ couples cluster 14 tocluster 16. Segment 20 ₄ couples cluster 16 to EBL 24, and segment 20 ₅couples EBL 24 to cluster 18. Further details regarding the operation ofdata ring 20 and EBL 24 appear below.

[0061]FIG. 2 illustrates a process employed by MPU 10 to transfer dataand memory location ownership in one embodiment of the presentinvention. For purposes of illustration, FIG. 2 demonstrates the processwith cluster 12—the same process is applicable to clusters 14, 16, and18.

[0062] Processing cluster 12 determines whether a memory location for anapplication operation is mapped into the cache memory in cluster 12(step 30). If cluster 12 has the location, then cluster 12 performs theoperation (step 32). Otherwise, cluster 12 issues a request for thenecessary memory location to global snoop controller 22 (step 34). Inone embodiment, cluster 12 issues the request via point-to-pointconnection 13. As part of the request, cluster 12 forwards a requestdescriptor that instructs snoop controller 22 and aids in tracking aresponse to the request.

[0063] Global snoop controller 22 responds to the memory request byissuing a snoop request to clusters 14, 16, and 18 (step 36). The snooprequest instructs each cluster to transfer either ownership of therequested memory location or the location's content to cluster 12.Clusters 14, 16, and 18 each respond to the snoop request by performingthe requested action or indicating it does not possess the requestedlocation (step 37). In one embodiment, global snoop controller 22 issuesthe request via snoop ring 21, and clusters 14, 16, and 18 performrequested ownership and data transfers via snoop ring 21. In addition toresponding on snoop ring 21, clusters acknowledge servicing the snooprequest through their point-to-point links with snoop controller 22.Snoop request processing will be explained in greater detail below.

[0064] If one of the snooped clusters possesses the requested memory,the snooped cluster forwards the memory to cluster 12 using data ring 20(step 37). In one embodiment, no data is transferred, but the requestedmemory location's ownership is transferred to cluster 12. Data andmemory location transfers between clusters will be explained in greaterdetail below.

[0065] Global snoop controller 22 analyzes the clusters' snoop responsesto determine whether the snooped clusters owned and transferred thedesired memory (step 38). If cluster 12 obtained access to the requestedmemory location in response to the snoop request, cluster 12 performsthe application operations (step 32). Otherwise, global snoop controller22 instructs EBL 24 to carry out an access to main memory 26 (step 40).EBL 24 transfers data between cluster 12 and main memory 26 on data ring20. Cluster 12 performs the application operation once the main memoryaccess is completed (step 32).

[0066] B. Processing Cluster

[0067] In one embodiment of the present invention, a processing clusterincludes a single compute engine for performing applications. Inalternate embodiments, a processing cluster employs multiple computeengines. A processing cluster in one embodiment of the present inventionalso includes a set of cache memory for expediting applicationprocessing. Embodiments including these features are described below.

[0068] 1. Processing Cluster—Single Compute Engine

[0069]FIG. 3 shows one embodiment of a processing cluster in accordancewith the present invention. For purposes of illustration, FIG. 3 showsprocessing cluster 12. In some embodiments of the present invention, thecircuitry shown in FIG. 3 is also employed in clusters 14, 16, and 18.

[0070] Cluster 12 includes compute engine 50 coupled to first tier datacache 52, first tier instruction cache 54, second tier cache 56, andmemory management unit (MMU) 58. Both instruction cache 54 and datacache 52 are coupled to second tier cache 56, which is coupled to snoopcontroller 22, snoop ring 21, and data ring 20. Compute engine 50manages a queue of application requests, each requiring an applicationto be performed on a set of data.

[0071] When compute engine 50 requires access to a block of memory,compute engine 50 converts a virtual address for the block of memoryinto a physical address. In one embodiment of the present invention,compute engine 50 internally maintains a limited translation buffer (notshown). The internal translation buffer performs conversions withincompute engine 50 for a limited number of virtual memory addresses.

[0072] Compute engine 50 employs MMU 58 for virtual memory addressconversions not supported by the internal translation buffer. In oneembodiment, compute engine 50 has separate conversion request interfacescoupled to MMU 58 for data accesses and instruction accesses. As shownin FIG. 3, compute engine 50 employs request interfaces 70 and 72 fordata accesses and request interface 68 for instruction access.

[0073] In response to a conversion request, MMU 58 provides either aphysical address and memory block size or a failed access response. Thefailed access responses include: 1) no corresponding physical addressexists; 2) only read access is allowed and compute engine 50 isattempting to write; or 3) access is denied.

[0074] After obtaining a physical address, compute engine 50 providesthe address to either data cache 52 or instruction cache 54—dataaccesses go to data cache 52, and instruction accesses go to instructioncache 54. In one embodiment, first tier caches 52 and 54 are 4Kdirect-mapped caches, with data cache 52 being write-through to secondtier cache 56. In an alternate embodiment, caches 52 and 54 are 8K 2-wayset associative caches.

[0075] A first tier cache (52 or 54) addressed by compute engine 50determines whether the addressed location resides in the addressed firsttier cache. If so, the cache allows compute engine 50 to perform therequested memory access. Otherwise, the first tier cache forwards thememory access of compute engine 50 to second tier cache 56. In oneembodiment second tier cache 56 is a 64K 4-way set associative cache.

[0076] Second tier cache 56 makes the same determination as the firsttier cache. If second tier cache 56 contains the requested memorylocation, compute engine 50 exchanges information with second tier cache56 through first tier cache 52 or 54. Instructions are exchanged throughinstruction cache 54, and data is exchanged through data cache 52.Otherwise, second tier cache 56 places a memory request to global snoopcontroller 22, which performs a memory retrieval process. In oneembodiment, the memory retrieval process is the process described abovewith reference to FIG. 2. Greater detail and embodiments addressingmemory transfers will be described below.

[0077] Cache 56 communicates with snoop controller 22 via point-to-pointlink 13 and snoop ring interfaces 21 ₁ and 21 ₃, as described in FIG. 1.Cache 56 uses link 13 to request memory accesses outside cluster 12.Second tier cache 56 receives and forwards snoop requests on snoop ringinterfaces 21 ₂ and 21 ₃. Cache 56 uses data ring interface segments 20₁ and 20 ₂ for exchanging data on data ring 20, as described above withreference to FIGS. 1 and 2.

[0078] In one embodiment, compute engine 50 contains CPU 60 coupled tocoprocessor 62. CPU 60 is coupled to MMU 58, data cache 52, andinstruction cache 54. Instruction cache 54 and data cache 52 couple CPU60 to second tier cache 56. Coprocessor 62 is coupled to data cache 52and MMU 58. First tier data cache 52 couples coprocessor 62 to secondtier cache 56.

[0079] Coprocessor 62 helps MPU 10 overcome processor utilizationdrawbacks associated with traditional multi-processing systems.Coprocessor 62 includes application specific processing engines designedto execute applications assigned to compute engine 50. This allows CPU60 to offload application processing to coprocessor 62, so CPU 60 caneffectively manage the queue of assigned application.

[0080] In operation, CPU 60 instructs coprocessor 62 to perform anapplication from the application queue. Coprocessor 62 uses itsinterfaces to MMU 58 and data cache 52 to obtain access to the memorynecessary for performing the application. Both CPU 60 and coprocessor 62perform memory accesses as described above for compute engine 50, exceptthat coprocessor 62 doesn't perform instruction fetches.

[0081] In one embodiment, CPU 60 and coprocessor 62 each include limitedinternal translation buffers for converting virtual memory addresses tophysical addresses. In one such embodiment, CPU 60 includes 2translation buffer entries for instruction accesses and 3 translationbuffer entries for data accesses. In one embodiment, coprocessor 62includes 4 translation buffer entries.

[0082] Coprocessor 62 informs CPU 60 once an application is complete.CPU 60 then removes the application from its queue and instructs a newcompute engine to perform the next application—greater details onapplication management will be provided below.

[0083] 2. Processing Cluster—Multiple Compute Engines

[0084]FIG. 4 illustrates an alternate embodiment of processing cluster12 in accordance with the present invention. In FIG. 4, cluster 12includes multiple compute engines operating the same as above-describedcompute engine 50. Cluster 12 includes compute engine 50 coupled to datacache 52, instruction cache 54, and MMU 82. Compute engine 50 includesCPU 60 and coprocessor 62 having the same coupling and operationdescribed above in FIG. 3. In fact, all elements appearing in FIG. 4with the same numbering as in FIG. 3 have the same operation asdescribed in FIG. 3.

[0085] MMU 82 and MMU 84 operate the same as MMU 58 in FIG. 3, exceptMMU 82 and MMU 84 each support two compute engines. In an alternateembodiment, cluster 12 includes 4 MMUs, each coupled to a single computeengine. Second tier cache 80 operates the same as second tier cache 56in FIG. 3, except second tier cache 80 is coupled to and supports datacaches 52, 92, 96, and 100 and instruction caches 54, 94, 98, and 102.Data caches 52, 92, 96, and 100 in FIG. 4 operate the same as data cache52 in FIG. 3, and instruction caches 54, 94, 98, and 102 operate thesame as instruction cache 54 in FIG. 3. Compute engines 50, 86, 88, and90 operate the same as compute engine 50 in FIG. 3.

[0086] Each compute engine (50, 86, 88, and 90) also includes a CPU (60,116, 120, and 124, respectively) and a coprocessor (62, 118, 122, and126, respectively) coupled and operating the same as described for CPU60 and coprocessor 62 in FIG. 3. Each CPU (60, 116, 120, and 124) iscoupled to a data cache (52, 92, 96, and 100, respectively), instructioncache (54, 94, 98, and 102, respectively), and MMU (82 and 84). Eachcoprocessor (62, 118, 122, and 126, respectively) is coupled to a datacache (52, 92, 96, and 100, respectively) and MMU (82 and 84). Each CPU(60, 116, 120, and 124) communicates with the MMU (82 and 84) viaseparate conversion request interfaces for data (70, 106, 110, and 114,respectively) and instructions (68, 104, 108, and 112, respectively)accesses. Each coprocessor (62, 118, 122, and 126) communicates with theMMU (82 and 84) via a conversion request interface (72, 73, 74, and 75)for data accesses.

[0087] In one embodiment, each coprocessor (62, 118, 122, and 126)includes four internal translation buffers, and each CPU (60, 116, 120,and 124) includes 5 internal translation buffers, as described abovewith reference to FIG. 3. In one such embodiment, translation buffers incoprocessors coupled to a common MMU contain the same addressconversions.

[0088] In supporting two compute engines, MMU 82 and MMU 84 each providearbitration logic to chose between requesting compute engines. In oneembodiment, MMU 82 and MMU 84 each arbitrate by servicing competingcompute engines on an alternating basis when competing addresstranslation requests are made. For example, in such an embodiment, MMU82 first services a request from compute engine 50 and then services arequest from compute engine 86, when simultaneous translation requestsare pending.

[0089] 3. Processing Cluster Memory Management

[0090] The following describes a memory management system for MPU 10 inone embodiment of the present invention. In this embodiment, MPU 10includes the circuitry described above with reference to FIG. 4.

[0091] a. Data Ring

[0092] Data ring 20 facilitates the exchange of data and instructionsbetween clusters 12, 14, 16, and 18 and EBL 24. Data ring 20 carriespackets with both header information and a payload. The payload containseither data or instructions from a requested memory location. Inoperation, either a cluster or EBL 24 places a packet on a segment ofdata ring 20. For example, cluster 18 drives data ring segment 20 ₁ intocluster 12. The header information identifies the intended target forthe packet. The EBL and each cluster pass the packet along data ring 20until the packet reaches the intended target. When a packet reaches theintended target (EBL 24 or cluster 12, 14, 16, or 18) the packet is nottransferred again.

[0093] In one embodiment of the present invention, data ring 20 includesthe following header signals: 1) Validity—indicating whether theinformation on data ring 20 is valid; 2) Cluster—identifying the clusterthat issues the memory request leading to the data ring transfer; 3)Memory Request—identifying the memory request leading to the data ringtransfer; 4) MESI—providing ownership status; and 5) TransferDone—indicating whether the data ring transfer is the last in aconnected series of transfers. In addition to the header, data ring 20includes a payload. In one embodiment, the payload carries 32 bytes. Inalternate embodiments of the present invention, different fields can beemployed on the data ring.

[0094] In some instances, a cluster needs to transfer more bytes than asingle payload field can store. For example, second tier cache 80typically transfers an entire 64 byte cache line. A transfer of thissize is made using two transfers on data ring 20—each carrying a 32 bytepayload. By using the header information, multiple data ring payloadtransfers can be concatenated to create a single payload in excess of 32bytes. In the first transfer, the Transfer Done field is set to indicatethe transfer is not done. In the second transfer, the Transfer Donefield indicates the transfer is done.

[0095] The MESI field provides status about the ownership of the memorylocation containing the payload. A device initiating a data ringtransfer sets the MESI field, along with the other header information.The MESI field has the following four states: 1) Modified; 2) Exclusive;3) Shared; and 4) Invalid. A device sets the MESI field to Exclusive ifthe device possesses sole ownership of the payload data prior totransfer on data ring 20. A device sets the MESI field to Modified ifthe device modifies the payload data prior to transfer on data ring20—only an Exclusive or Modified owner can modify data. A device setsthe MESI field to Shared if the data being transferred onto data ring 20currently has a Shared or Exclusive setting in the MESI field andanother entity requests ownership of the data. A device sets the MESIfield to Invalid if the data to be transferred on data ring 20 isinvalid. Examples of MESI field setting will be provided below.

[0096] b. First Tier Cache Memory

[0097]FIG. 5a illustrates a pipeline of operations performed by firsttier data caches 52, 92, 96, 100, in one embodiment of the presentinvention. For ease of reference, FIG. 5 is explained with reference todata cache 52, although the implementation shown in FIG. 5 is applicableto all first tier data caches.

[0098] In stage 360, cache 52 determines whether to select a memoryaccess request from CPU 60, coprocessor 62, or second tier cache 80. Inone embodiment, cache 52 gives cache 80 the highest priority and togglesbetween selecting the CPU and coprocessor. As will be explained below,second tier cache 80 accesses first tier cache 52 to provide fill datawhen cache 52 has a miss.

[0099] In stage 362, cache 52 determines whether cache 52 contains thememory location for the requested access. In one embodiment, cache 52performs a tag lookup using bits from the memory address of the CPU,coprocessor, or second tier cache. If cache 52 detects a memory locationmatch, the cache's data array is also accessed in stage 362 and therequested operation is performed.

[0100] In the case of a load operation from compute engine 50, cache 52supplies the requested data from the cache's data array to computeengine 50. In the case of a store operation, cache 52 stores datasupplied by compute engine 50 in the cache's data array at the specifiedmemory location. In one embodiment of the present invention, cache 52 isa write-through cache that transfers all stores through to second tiercache 80. The store operation only writes data into cache 52 after amemory location match—cache 52 is not filled after a miss. In one suchembodiment, cache 52 is relieved of maintaining cache line ownership.

[0101] In one embodiment of the present invention, cache 52 implementsstores using a read-modify-write protocol. In such an embodiment, cache52 responds to store operations by loading the entire data array cacheline corresponding to the addressed location into store buffer 367.Cache 52 modifies the data in store buffer 367 with data from the storeinstruction issued by compute engine 50. Cache 52 then stores themodified cache line in the data array when cache 52 has a free cycle. Ifa free cycle doesn't occur before the next write to store buffer 367,cache 52 executes the store without using a free cycle.

[0102] In an alternate embodiment, the store buffer is smaller than anentire cache line, so cache 52 only loads a portion of the cache lineinto the store buffer. For example, in one embodiment cache 52 has a 64byte cache line and a 16 byte store buffer. In load operations, databypasses store buffer 367.

[0103] Cache 52 also provides parity generation and checking. When cache52 writes the data array, a selection is made in stage 360 between usingstore buffer data (SB Data) and second tier cache fill data (ST Data).Cache 52 also performs parity generation on the selected data in stage360 and writes the data array in stage 362. Cache 52 also parity checksdata supplied from the data array in stage 362.

[0104] If cache 52 does not detect an address match in stage 362, thencache 52 issues a memory request to second tier cache 80. Cache 52 alsoissues a memory request to cache 80 if cache 52 recognizes a memoryoperation as non-cacheable.

[0105] Other memory related operations issued by compute engine 50include pre-fetch and store-create. A pre-fetch operation calls forcache 52 to ensure that an identified cache line is mapped into the dataarray of cache 52. Cache 52 operates the same as a load operation of afull cache line, except no data is returned to compute engine 50. Ifcache 52 detects an address match in stage 362 for a pre-fetchoperation, no further processing is required. If an address miss isdetected, cache 52 forwards the pre-fetch request to cache 80. Cache 52loads any data returned by cache 80 into the cache 52 data array.

[0106] A store-create operation calls for cache 52 to ensure that cache52 is the sole owner of an identified cache line, without regard forwhether the cache line contains valid data. In one embodiment, apredetermined pattern of data is written into the entire cache line. Thepredetermined pattern is repeated throughout the entire cache line.Compute engine 50 issues a store-create command as part of a storeoperand for storing data into an entire cache line. All store-createrequests are forwarded to cache 80, regardless of whether an addressmatch occurs.

[0107] In one embodiment, cache 52 issues memory requests to cache 80over a point-to-point link, as shown in FIGS. 3 and 4. This link allowscache 80 to receive the request and associated data and respondaccordingly with data and control information. In one such embodiment,cache 52 provides cache 80 with a memory request that includes thefollowing fields: 1) Validity—indicating whether the request is valid;2) Address—identifying the memory location requested; and 3)Opcode—identifying the memory access operation requested.

[0108] After receiving the memory request, cache 80 generates thefollowing additional fields: 4) Dependency—identifying memory accessoperations that must be performed before the requested memory access; 5)Age—indicating the time period the memory request has been pending; and6) Sleep—indicating whether the memory request has been placed in sleepmode, preventing the memory request from being reissued. Sleep mode willbe explained in further detail below. Cache 80 sets the Dependency fieldin response to the Opcode field, which identifies existing dependencies.

[0109] In one embodiment of the present invention, cache 52 includesfill buffer 366 and replay buffer 368. Fill buffer 366 maintains a listof memory locations from requests transferred to cache 80. The listedlocations correspond to requests calling for loads. Cache 52 employsfill buffer 366 to match incoming fill data from second tier cache 80with corresponding load commands. The corresponding load command informscache 52 whether the incoming data is a cacheable load for storage inthe cache 52 data array or a non-cacheable load for direct transfer tocomputer engine 50.

[0110] As an additional benefit, fill buffer 366 enables cache 52 toavoid data corruption from an overlapping load and store to the samememory location. If compute engine 50 issues a store to a memorylocation listed in fill buffer 366, cache 52 will not write datareturned by cache 80 for the memory location to the data array. Cache 52removes a memory location from fill buffer 366 after cache 80 servicesthe associated load. In one embodiment, fill buffer 366 contains 5entries.

[0111] Replay buffer 368 assists cache 52 in transferring data fromcache 80 to compute engine 50. Replay buffer 368 maintains a list ofload requests forwarded to cache 80. Cache 80 responds to a load requestby providing an entire cache line—up to 64 bytes in one embodiment. Whena load request is listed in replay buffer 368, cache 52 extracts therequested load memory out of the returned cache line for compute engine50. This relieves cache 52 from retrieving the desired memory from thedata array after a fill completes.

[0112] Cache 52 also uses replay buffer 368 to perform any operationsnecessary before transferring the extracted data back to compute engine50. For example, cache 80 returns an entire cache line of data, but insome instances compute engine 50 only requests a portion of the cacheline. Replay buffer 368 alerts cache 52, so cache 52 can realign theextracted data to appear in the data path byte positions desired bycompute engine 50. The desired data operations, such as realignments androtations, are stored in replay buffer 368 along with theircorresponding requests.

[0113]FIG. 5b shows a pipeline of operations for first tier instructionscaches 54, 94, 98, and 102 in one embodiment of the present invention.The pipeline shown in FIG. 5b is similar to the pipeline shown in FIG.5a, with the following exceptions. A coprocessor does not access a firsttier instruction cache, so the cache only needs to select between a CPUand second tier cache in stage 360. A CPU does not write to aninstruction cache, so only second tier data (ST Data) is written intothe cache's data array in step 362. An instruction cache does notinclude either a fill buffer, replay buffer, or store buffer.

[0114] c. Second Tier Cache Memory

[0115]FIG. 6 illustrates a pipeline of operations implemented by secondtier cache 80 in one embodiment of the present invention. In stage 370,cache 80 accepts memory requests. In one embodiment, cache 80 is coupledto receive memory requests from external sources (Fill), global snoopcontroller 22 (Snoop), first tier data caches 52, 92, 96, and 100(FTD-52; FTD-92; FTD-96; FTD-100), and first tier instruction caches 54,94, 98, and 102 (FTI-54; FTI-94; FTI-98; FTI-102). In one embodiment,external sources include external bus logic 24 and other clustersseeking to drive data on data ring 20.

[0116] As shown in stage 370, cache 80 includes memory request queues382, 384, 386, and 388 for receiving and maintaining memory requestsfrom data caches 54, 52, 92, 96, and 100, respectively. In oneembodiment, memory request queues 382, 384, 386, and 388 hold up to 8memory requests. Each queue entry contains the above-described memoryrequest descriptor, including the Validity, Address, Opcode, Dependency,Age, and Sleep fields. If a first tier data cache attempts to make arequest when its associated request queue is full, cache 80 signals thefirst tier cache that the request cannot be accepted. In one embodiment,the first tier cache responds by submitting the request later. In analternate embodiment, the first tier cache kills the requested memoryoperation.

[0117] Cache 80 also includes snoop queue 390 for receiving andmaintaining requests from snoop ring 21. Upon receiving a snoop request,cache 80 buffers the request in queue 390 and forwards the request tothe next cluster on snoop ring 21. In one embodiment of the presentinvention, global snoop controller 22 issues the following types ofsnoop requests: 1) Own—instructing a cluster to transfer exclusiveownership of a memory location and transfer its content to anothercluster after performing any necessary coherency updates; 2)Share—instructing a cluster to transfer shared ownership of a memorylocation and transfer its contents to another cluster after performingany necessary coherency updates; and 3) Kill—instructing a cluster torelease ownership of a memory location without performing any datatransfers or coherency updates.

[0118] In one such embodiment, snoop requests include descriptors withthe following fields: 1) Validity—indicating whether the snoop requestis valid; 2) Cluster—identifying the cluster that issued the memoryrequest leading to the snoop request; 3) Memory Request—identifying thememory request leading to the snoop request; 4) ID—an identifier globalsnoop controller 22 assigns to the snoop request; 5) Address—identifyingthe memory location requested; and 5) Opcode—identifying the type ofsnoop request.

[0119] Although not shown, cache 80 includes receive data buffers, inaddition to the request queues shown in stage 370. The receive databuffers hold data passed from cache 52 for use in requested memoryoperations, such as stores. In one embodiment, cache 80 does not containthe receive data buffers for data received from data ring 20 along withFill requests, since Fill requests are serviced with the highestpriority. Cache 80 includes a scheduler for assigning priority to theabove-described memory requests. In stage 370, the scheduler begins theprioritization process by selecting requests that originate from snoopqueue 390 and each of compute engines 50, 86, 88, and 90, if any exist.For snoop request queue 390, the scheduler selects the first requestwith a Validity field showing the request is valid. In one embodiment,the scheduler also selects an entry before it remains in queue 390 for apredetermined period of time.

[0120] For each compute engine, the scheduler gives first tierinstruction cache requests (FTI) priority over first tier data cacherequests (FTD). In each data cache request queue (382, 384, 386, and388), the scheduler assigns priority to memory requests based onpredetermined criteria. In one embodiment, the predetermined criteriaare programmable. A user can elect to have cache 80 assign prioritybased on a request's Opcode field or the age of the request. Thescheduler employs the above-described descriptors to make these prioritydeterminations.

[0121] For purposes of illustration, the scheduler's programmableprioritization is described with reference to queue 382. The sameprioritization process is performed for queues 384, 386, and 388. In oneembodiment, priority is given to load requests. The scheduler in cache80 reviews the Opcode fields of the request descriptors in queue 382 toidentify all load operations. In an alternate embodiment, storeoperations are favored. The scheduler also identifies these operationsby employing the Opcode field.

[0122] In yet another embodiment, cache 80 gives priority to the oldestrequests in queue 382. The scheduler in cache 80 accesses the Age fieldin the request descriptors in queue 382 to determine the oldest memoryrequest. Alternative embodiments also provide for giving priority to thenewest request. In some embodiments of the present invention,prioritization criteria are combined. For example, cache 80 givespriority to load operations and a higher priority to older loadoperations. Those of ordinary skill in the art recognize that manypriority criteria combinations are possible.

[0123] In stage 372, the scheduler selects a single request from thefollowing: 1) the selected first tier cache requests; 2) the selectedsnoop request from stage 370; and 3) Fill. In one embodiment, thescheduler gives Fill the highest priority, followed by Snoop, which isfollowed by the first tier cache requests. In one embodiment, thescheduler in cache 80 services the first tier cache requests on a roundrobin basis.

[0124] In stage 374, cache 80 determines whether it contains the memorylocation identified in the selected request from stage 372. If theselected request is Fill from data ring 20, cache 80 uses informationfrom the header on data ring 20 to determine whether the clustercontaining cache 80 is the target cluster for the data ring packet.Cache 80 examines the header's Cluster field to determine whether theFill request corresponds to the cluster containing cache 80.

[0125] If any request other than Fill is selected in stage 372, cache 80uses the Address field from the corresponding request descriptor toperform a tag lookup operation. In the tag lookup operation, cache 80uses one set of bits in the request descriptor's Address field toidentify a targeted set of ways. Cache 80 then compares another set ofbits in the Address field to tags for the selected ways. If a tag matchoccurs, the requested memory location is in the cache 80 data array.Otherwise, there is a cache miss. In one such embodiment, cache 80 is a64K 4-way set associative cache with a cache line size of 64 bytes.

[0126] In one embodiment, as shown in FIG. 6, cache 80 performs the taglookup or Cluster field comparison prior to reading any data from thedata array in cache 80. This differs from a traditional multiple-way setassociate cache. A traditional multiple-way cache reads a line of datafrom each addressed way at the same time a tag comparison is made. Ifthere is not a match, the cache discards all retrieved data. If there isa match, the cache employs the retrieved data from the selected way.Simultaneously retrieving data from multiple ways consumes considerableamounts of both power and circuit area.

[0127] Conserving both power and circuit area are importantconsiderations in manufacturing integrated circuits. In one embodiment,cache 80 is formed on a single integrated circuit. In anotherembodiment, MPU 10 is formed on a single integrated circuit. Performingthe lookups before retrieving cache memory data makes cache 80 moresuitable for inclusion on a single integrated circuit.

[0128] In stage 376, cache 80 responds to the cache address comparisonperformed in stage 374. Cache 80 contains read external request queue(“read ERQ”) 392 and write external request queue (“write ERQ”) 394 forresponding to hits and misses detected in stage 374. Read ERQ 392 andwrite ERQ 394 allow cache 80 to forward memory access requests to globalsnoop controller 22 for further processing.

[0129] In one embodiment, read ERQ 392 contains 16 entries, with 2entries reserved for each compute engine. Read ERQ 392 reserves entries,because excessive pre-fetch operations from one compute engine mayotherwise consume the entire read ERQ. In one embodiment, write ERQ 394includes 4 entries. Write ERQ 394 reserves one entry for requests thatrequire global snoop controller 22 to issue snoop requests on snoop ring21.

[0130] Processing First Tier Request Hits: Once cache 80 detects anaddress match for a first tier load or store request, cache 80 accessesinternal data array 396, which contains all the cached memory locations.The access results in data array 396 outputting a cache line containingthe addressed memory location in stage 378. In one embodiment, the dataarray has a 64 byte cache line and is formed by 8 8K buffers, eachhaving a data path 8 bytes wide. In such an embodiment, cache 80accesses a cache line by addressing the same offset address in each ofthe 8 buffers.

[0131] An Error Correcting Code (“ECC”) check is performed on theretrieved cache line to check and correct any cache line errors. ECC isa well-known error detection and correction operation. The ECC operationoverlaps between stages 378 and 380.

[0132] If the requested operation is a load, cache 80 supplies the cacheline contents to first tier return buffer 391. First tier return buffer391 is coupled to provide the cache line to the requesting first tiercache. In one embodiment of the present invention, cache 80 includesmultiple first tier return buffers (not shown) for transferring databack to first tier caches. In one such embodiment, cache 80 includes 4first tier return buffers.

[0133] If the requested operation is a store, cache 80 performs aread-modify-write operation. Cache 80 supplies the addressed cache lineto store buffer 393 in stage 380. Cache 80 modifies the store bufferbytes addressed by the first tier memory request. Cache 80 then forwardsthe contents of the store buffer to data array 396. Cache 80 makes thistransfer once cache 80 has an idle cycle or a predetermined period oftime elapses. For stores, no data is returned to first tier data cache52.

[0134]FIG. 7 illustrates the pipeline stage operations employed by cache80 to transfer the cache line in a store buffer to data array 396 andfirst tier return buffer 393. This process occurs in parallel with theabove-described pipeline stages. In stage 374, cache 80 selects betweenpending data array writes from store buffer 393 and data ring 20 viaFill requests. In one embodiment, Fill requests take priority. In onesuch embodiment, load accesses to data array 396 have priority overwrites from store buffer 393. In alternate embodiments, differentpriorities are assigned.

[0135] In stage 376, cache 80 generates an ECC checksum for the dataselected in stage 374. In stage 378, cache 80 stores the modified storebuffer data in the cache line corresponding to the first tier request'sAddress field. Cache 80 performs an ECC check between stages 378 and380. Cache 80 then passes the store buffer data to first return buffer391 in stage 380 for return to the first tier cache.

[0136] If the hit request is a pre-fetch, cache 80 operates the same asexplained above for a load.

[0137] Processing First Tier Request Misses: If the missed request'sOpcode field calls for a non-cacheable load, cache 80 forwards themissed request's descriptor to read ERQ 392. Read ERQ forwards therequest descriptor to global snoop controller 22, which initiatesretrieval of the requested data from main memory 26 by EBL 24.

[0138] If the missed request's Opcode field calls for a cacheable load,cache 80 performs as described above for a non-cacheable load with thefollowing modifications. Global snoop controller 22 first initiatesretrieval of the requested data from other clusters by issuing asnoop-share request on snoop ring 21. If the snoop request does notreturn the desired data, then global snoop controller 22 initiatesretrieval from main memory 26 via EBL 24. Cache 80 also performs aneviction procedure. In the eviction procedure, cache 80 selects alocation in the data array for a cache line of data containing therequested memory location. If the selected data array location containsdata that has not been modified, cache 80 overwrites the selectedlocation when the requested data is eventually returned on data ring 20.

[0139] If the selected data array location has been modified, cache 80writes the cache line back to main memory 26 using write ERQ 394 anddata ring 20. Cache 80 submits a request descriptor to write ERQ 394 instage 376. The request descriptor is in the format of a first tierdescriptor. Write ERQ 394 forwards the descriptor to global snoopcontroller 22. Snoop controller 22 instructs external bus logic 24 tocapture the cache line off data ring 20 and transfer it to main memory26. Global snoop controller 22 provides external bus logic 24 withdescriptor information that enables logic 24 to recognize the cache lineon data ring 20. In one embodiment, this descriptor includes theabove-described information found in a snoop request descriptor.

[0140] Cache 80 accesses the selected cache line in data array 396, asdescribed above, and forwards the line to data ring write buffer 395 instages 376 through 380 (FIG. 6). Data ring write buffer 395 is coupledto provide the cache line on data ring 20. In one embodiment, cache 80includes 4 data ring write buffers. Cache 80 sets the data ring headerinformation for two 32 byte payload transfers as follows: 1)Validity—valid; 2) Cluster—External Bus Logic 24; 3) Memory RequestIndicator—corresponding to the request sent to write ERQ 394; 4)MESI—Invalid; and 5) Transfer Done—set to “not done” for the first 32byte transfer and “done” for the second 32 byte transfer. The headerinformation enables EBL 24 to capture the cache line off data ring 20and transfer it to main memory 26.

[0141] Cache 80 performs an extra operation if a store has beenperformed on the evicted cache line and the store buffer data has notbeen written to the data array 396. In this instance, cache 80 utilizesthe data selection circuitry from stage 380 (FIG. 7) to transfer thedata directly from store buffer 393 to data ring write buffer 395.

[0142] If the missed request's Opcode field calls for a non-cacheablestore, cache 80 forwards the request to write ERQ 394 in stage 376 forsubmission to global snoop controller 22. Global snoop controller 22provides a main memory write request to external bus logic 24, asdescribed above. In stage 378 (FIG. 7), cache controller 80 selects thedata from the non-cacheable store operation. In stage 380, cache 80forwards the data to data ring write buffer 395. Cache 80 sets the dataring header as follows for two 32 byte payload transfers: 1)Validity—valid; 2) Cluster—External Bus Logic 24; 3) MemoryRequest—corresponding to the request sent to write ERQ 394; 4)MESI—Invalid; and 5) Transfer Done—set to “not done” for the first 32byte transfer and “done” for the second 32 byte transfer.

[0143] If the missed request's Opcode field calls for a cacheable store,cache 80 performs the same operation as explained above for a missedcacheable load. This is because cache 80 performs stores using a.read-modify-write operation. In one embodiment, snoop controller 22issues a snoop-own request in response to the read ERQ descriptor forcache 80.

[0144] If the missed request's Opcode field calls for a pre-fetch, cache80 performs the same operation as explained above for a missed cacheableload.

[0145] Processing First Tier Requests for Store-Create Operations: Whena request's Opcode field calls for a store-create operation, cache 80performs an address match in storage 374. If there is not a match, cache80 forwards the request to global snoop controller 22 through read ERQ392 in stage 376. Global snoop controller 22 responds by issuing asnoop-kill request on snoop ring 21. The snoop-kill request instructsall other clusters to relinquish control of the identified memorylocation. Second tier cache responses to snoop-kill requests will beexplained below.

[0146] If cache 80 discovers an address match in stage 374, cache 80determines whether the matching cache line has an Exclusive or ModifiedMESI state. In either of these cases, cache 80 takes no further action.If the status is Shared, then cache 80 forwards the request to snoopcontroller 22 as described above for the non-matching case.

[0147] Processing Snoop Request Hits: If the snoop request Opcode fieldcalls for an own operation, cache 80 relinquishes ownership of theaddressed cache line and transfers the line's contents onto data ring20. Prior to transferring the cache line, cache 80 updates the line, ifnecessary.

[0148] Cache 80 accesses data array 396 in stage 378 (FIG. 6) toretrieve the contents of the cache line containing the desired data—theAddress field in the snoop request descriptor identifies the desiredcache line. This access operates the same as described above for firsttier cacheable load hits. Cache 80 performs ECC checking and correctionis stages 378 and 380 and writes the cache line to data ring writebuffer 395. Alternatively, if the retrieved cache line buffer needs tobe updated, cache 80 transfers the contents of store buffer 393 to dataring write buffer 395 (FIG. 7).

[0149] Cache 80 provides the following header information to the dataring write buffer along with the cache line: 1) Validity—valid; 2)Cluster—same as in the snoop request; 3) Memory Request—same as in thesnoop request; 4) MESI—Exclusive (if the data was never modified whilein cache 80) or Modified (if the data was modified while in cache 80);and 5) Transfer Done—“not done”, except for the header connected withthe final payload for the cache line. Cache 80 then transfers thecontents of data ring write buffer 395 onto data ring 20.

[0150] Cache 80 also provides global snoop controller 22 with anacknowledgement that cache 80 serviced the snoop request. In oneembodiment, cache 80 performs the acknowledgement via the point-to-pointlink with snoop controller 22.

[0151] If the snoop request Opcode field calls for a share operation,cache 80 performs the same as described above for a read operation withthe following exceptions. Cache 80 does not necessarily relinquishownership. Cache 80 sets the MESI field to Shared if the requested cacheline's current MESI status is Exclusive or Shared. However, if thecurrent MESI status for the requested cache line is Modified, then cache80 sets the MESI data ring field to Modified and relinquishes ownershipof the cache line. Cache 80 also provides global snoop controller 22with an acknowledgement that cache 80 serviced the snoop request, asdescribed above.

[0152] If the snoop request Opcode field calls for a kill operation,cache 80 relinquishes ownership of the addressed cache line and does nottransfer the line's contents onto data ring 20. Cache 80 also providesglobal snoop controller 22 with an acknowledgement that cache 80serviced the snoop request, as described above.

[0153] Processing Snoop Request Misses: If the snoop request is a miss,cache 80 merely provides an acknowledgement to global snoop controller22 that cache 80 serviced the snoop request.

[0154] Processing Fill Requests With Cluster Matches: If a Fill requesthas a cluster match, cache 80 retrieves the original request that led tothe incoming data ring Fill request. The original request is containedin either read ERQ 392 or write ERQ 394. The Memory Request field fromthe incoming data ring header identifies the corresponding entry in readERQ 392 or write ERQ 394. Cache 80 employs the Address and Opcode fieldsfrom the original request in performing further processing.

[0155] If the original request's Opcode field calls for a cacheableload, cache 80 transfers the incoming data ring payload data into dataarray 396 and first tier return buffer 391. In stage 374, (FIG. 7) cache80 selects the Fill Data, which is the payload from data ring 20. Instage 376, cache 80 performs ECC generation. In stage 378, cache 80accesses data array 396 and writes the Fill Data into the addressedcache line. Cache 80 performs the data array access based on the Addressfield in the original request descriptor. As explained above, cache 80previously assigned the Address field address a location in data array396 before forwarding the original request to global snoop controller22. The data array access also places the Fill Data into first tierreturn buffer 391. Cache 80 performs ECC checking in stages 378 and 380and loads first tier return buffer 391.

[0156] If the original request's Opcode field calls for a non-cacheableload, cache 80 selects Fill Data in stage 378 (FIG. 7). Cache 80 thenforwards the Fill Data to first tier return buffer 391 in stage 380.First tier return buffer 391 passes the payload data back to the firsttier cache requesting the load. If the original request's Opcode fieldcalls for a cacheable store, cache 80 responds as follows in oneembodiment. First, cache 80 places the Fill Data in data array 396—cache80 performs the same operations described above for a response to acacheable load Fill request. Next, cache 80 performs a store using thedata originally supplied by the requesting compute engine—cache 80performs the same operations as described above for a response to acacheable store first tier request with a hit.

[0157] In an alternate embodiment, cache 80 stores the data originallyprovided by the requesting compute engine in store buffer 393. Cache 80then compares the store buffer data with the Fill Data—modifying storebuffer 393 to include Fill Data in bit positions not targeted for newdata storage in the store request. Cache 80 writes the contents of storebuffer 393 to data array 396 when there is an idle cycle or anotheraccess to store buffer 393 is necessary, whichever occurs first.

[0158] If the original request's Opcode field calls for a pre-fetch,cache 80 responds the same as for a cacheable load Fill request.

[0159] Processing Fill Requests Without Cluster Matches: If a Fillrequest does not have a cluster match, cache 80 merely places theincoming data ring header and payload back onto data ring 20.

[0160] Cache 80 also manages snoop request queue 390 and data cacherequest queues 382, 384, 386, and 388. Once a request from snoop requestqueue 390 or data cache request queue 382, 384, 386 or 388 is sent toread ERQ 392 or write ERQ 394, cache 80 invalidates the request to makeroom for more requests. Once a read ERQ request or write ERQ request isserviced, cache 80 removes the request from the ERQ. Cache 80 removes arequest by setting the request's Validity field to an invalid status.

[0161] In one embodiment, cache 80 also includes a sleep mode to aid inqueue management. Cache 80 employs sleep mode when either read ERQ 392or write ERQ 394 is full and cannot accept another request from a firsttier data cache request queue or snoop request queue. Instead ofrefusing service to a request or flushing the cache pipeline, cache 80places the first tier or snoop request in a sleep mode by setting theSleep field in the request descriptor. When read ERQ 392 or write ERQ394 can service the request, cache 80 removes the request from sleepmode and allows it to be reissued in the pipeline.

[0162] In another embodiment of the invention, the scheduler in cache 80filters the order of servicing first tier data cache requests to ensurethat data is not corrupted. For example, CPU 60 may issue a loadinstruction for a memory location, followed by a store for the samelocation. The load needs to occur first to avoid loading improper data.Due to either the CPU's pipeline or a reprioritization by cache 80, theorder of the load and store commands in the above example can becomereversed.

[0163] Processors traditionally resolve the dilemma in the above exampleby issuing no instructions until the load in the above example iscompleted. This solution, however, has the drawback of slowingprocessing speed—instruction cycles go by without the CPU performing anyinstructions.

[0164] In one embodiment of the present invention, the prioritizationfilter of cache 80 overcomes the drawback of the traditional processorsolution. Cache 80 allows memory requests to be reordered, but norequest is allowed to precede another request upon which it isdependent. For example, a set of requests calls for a load from locationA, a store to location A after the load from A, and a load from memorylocation B. The store to A is dependent on the load from A beingperformed first. Otherwise, the store to A corrupts the load from A. Theload from A and load from B are not dependent on other instructionspreceding them. Cache 80 allows the load from A and load from B to beperformed in any order, but the store to A is not allowed to proceeduntil the load from A is complete. This allows cache 80 to service theload from B, while waiting for the load from A to complete. Noprocessing time needs to go idle.

[0165] Cache 80 implements the prioritization filter using read ERQ 392,write ERQ 394, and the Dependency field in a first tier data cacherequest descriptor. The Dependency field identifies requests in thefirst tier data cache request queue that must precede the dependentrequest. Cache 80 does not select the dependent request from the datacache request queue until all the dependent requests have been serviced.Cache 80 recognizes a request as serviced once the request's Validityfield is set to an invalid state, as described above.

[0166] C. Global Snoop Controller

[0167] Global snoop controller 22 responds to requests issued byclusters 12, 14, 16, and 18. As demonstrated above, these requests comefrom read ERQ and write ERQ buffers in second tier caches. The requestsinstruct global snoop controller 22 to either issue a snoop request oran access to main memory. Additionally, snoop controller 22 converts anown or share snoop request into a main memory access request to EBL 24when no cluster performs a requested memory transfer. Snoop controller22 uses the above-described acknowledgements provided by the clusters'second tier caches to keep track of memory transfers performed byclusters.

[0168] D. Application Processing

[0169]FIG. 8a illustrates a process employed by MPU 10 for executingapplications in one embodiment of the present invention. FIG. 8aillustrates a process in which MPU 10 is employed in anapplication-based router in a communications network. Generally, anapplication-based router identifies and executes applications that needto be performed on data packets received from a communication medium.Once the applications are performed for a packet, the router determinesthe next network destination for the packet and transfers the packetover the communications medium. MPU 10 receives a data packet from acommunications medium coupled to MPU 10 (step 130). In one embodiment,MPU 10 is coupled to an IEEE 802.3 compliant network running GigabitEthernet. In other embodiments, MPU 10 is coupled to different networksand in some instances operates as a component in a wide area network. Acompute engine in MPU 10, such as compute engine 50 in FIG. 4, isresponsible for receiving packets. In such an embodiment, coprocessor 62includes application specific circuitry coupled to the communicationsmedium for receiving packets. Coprocessor 62 also includes applicationspecific circuitry for storing the packets in data cache 52 and secondtier cache 80. The reception process and related coprocessor circuitrywill be described below in greater detail.

[0170] Compute engine 50 transfers ownership of received packets to aflow control compute engine, such as compute engine 86, 88, or 90 inFIG. 4 (step 132). Compute engine 50 transfers packet ownership byplacing an entry in the application queue of the flow control computeengine.

[0171] The flow control compute engine forwards ownership of each packetto a compute engine in a pipeline set of compute engines (step 134). Thepipeline set of compute engines is a set of compute engines that willcombine to perform applications required for the forwarded packet. Theflow control compute engine determines the appropriate pipeline byexamining the packet to identify the applications to be performed. Theflow control compute engine transfers ownership to a pipeline capable ofperforming the required applications.

[0172] In one embodiment of the present invention, the flow controlcompute engine uses the projected speed of processing applications as aconsideration in selecting a pipeline. Some packets requiresignificantly more processing than others. A limited number of pipelinesare designated to receive such packets, in order to avoid these packetsconsuming all of the MPU processing resources.

[0173] After the flow control compute engine assigns the packet to apipeline (step 134), a pipeline compute engine performs a requiredapplication for the assigned packet (step 136). Once the application iscompleted, the pipeline compute engine determines whether anyapplications still need to be performed (step 138). If more applicationsremain, the pipeline compute engine forwards ownership of the packet toanother compute engine in the pipeline (step 134) and theabove-described process is repeated. This enables multiple services tobe performed by a single MPU. If no applications remain, the pipelinecompute engine forwards ownership of the packet to a transmit computeengine (step 140).

[0174] The transmit compute engine transmits the data packet to a newdestination of the network, via the communications medium (step 142). Inone such embodiment, the transmit compute engine includes a coprocessorwith application specific circuitry for transmitting packets. Thecoprocessor also includes application specific circuitry for retrievingthe packets from memory. The transmission process and relatedcoprocessor circuitry will be described below in greater detail.

[0175]FIG. 8b illustrates a process for executing applications in analternate embodiment of the present invention. This embodiment employsmultiple multi-processor units, such as MPU 10. In this embodiment, themulti-processor units are coupled together over a communications medium.In one version, the multi-processor units are coupled together bycross-bar switches, such as the cross-bar switch disclosed in U.S.patent application Ser. No. ______ entitled Cross-Bar Switch, filed onJul. 6, 2001, having Attorney Docket No. NEXSI-01022US0, and herebyincorporated by reference.

[0176] In the embodiment shown in FIG. 8b, steps with the same referencenumbers as steps in FIG. 8a operate as described for FIG. 8a. Thedifference is that packets are assigned to a pipeline set ofmulti-processor units, instead of a pipeline set of compute engines.Each multi-processor unit in a pipeline transfers packets to the nextmulti-processor unit in the pipeline via the communications medium (step133). In one such embodiment, each multi-processor unit has a computeengine coprocessor with specialized circuitry for performingcommunications medium receptions and transmissions, as well asexchanging data with cache memory. In one version of the FIG. 8bprocess, each multi-processor unit performs a dedicated application. Inalternate embodiments, a multi-processor unit performs multipleapplications.

[0177] Although MPU 10 has been described above with reference to arouter application, MPU 10 can be employed in many other applications.One example is video processing. In such an application, packetreception step 130 is replaced with a different operation that assignsvideo processing applications to MPU 10. Similarly, packet transmissionstep 142 is replaced with an operation that delivers processed videodata.

[0178] E. Coprocessor

[0179] As described above, MPU 10 employs coprocessors in clustercompute engines to expedite application processing. The following setsforth coprocessor implementations employed in one set of embodiments ofthe present invention. One of ordinary skill will recognize thatalternate coprocessor implementations can also be employed in an MPU inaccordance with the present invention.

[0180] 1. Coprocessor Architecture and Operation

[0181]FIG. 9a illustrates a coprocessor in one embodiment of the presentinvention, such as coprocessor 62 from FIGS. 3 and 4. Coprocessor 62includes sequencers 150 and 152, each coupled to CPU 60, arbiter 176,and a set of application engines. The application engines coupled tosequencer 150 include streaming input engine 154, streaming outputengine 162, and other application engines 156, 158, and 160. Theapplication engines coupled to sequencer 152 include streaming inputengine 164, streaming output engine 172, and other application engines166, 168, and 170. In alternate embodiments any number of applicationengines are coupled to sequencers 150 and 152.

[0182] Sequencers 150 and 152 direct the operation of their respectivecoupled engines in response to instructions received from CPU 60. In oneembodiment, sequencers 150 and 152 are micro-code based sequencers,executing micro-code routines in response to instructions from CPU 60.Sequencers 150 and 152 provide output signals and instructions thatcontrol their respectively coupled engines in response to theseroutines. Sequencers 150 and 152 also respond to signals and dataprovided by their respectively coupled engines. Sequencers 150 and 152additionally perform application processing internally in response toCPU 60 instructions.

[0183] Streaming input engines 154 and 164 each couple coprocessor 62 todata cache 52 for retrieving data. Streaming output engines 162 and 172each couple coprocessor 62 to data cache 52 for storing data to memory.Arbiter 176 couples streaming input engines 154 and 164, and streamingoutput engines 162 and 172, and sequencers 150 and 152 to data cache 52.In one embodiment, arbiter 176 receives and multiplexes the data pathsfor the entities on coprocessor 62. Arbiter 176 ensures that only oneentity at a time receives access to the interface lines betweencoprocessor 62 and data cache 52. Micro-MMU 174 is coupled to arbiter176 to provide internal conversions between virtual and physicaladdresses. In one embodiment of the present invention, arbiter 176performs a round-robin arbitration scheme. Mirco-MMU 174 contains theabove-referenced internal translation buffers for coprocessor 62 andprovides coprocessor 62's interface to MMU 58 (FIG. 3) or 82 (FIG. 4).

[0184] Application engines 156, 158, 160, 166, 168, and 170 each performa data processing application relevant to the job being performed by MPU10. For example, when MPU 10 is employed in one embodiment as anapplication based router, application engines 156, 158, 160, 166, 168,and 170 each perform one of the following: 1) data string copies; 2)polynomial hashing; 3) pattern searching; 4) RSA modulo exponentiation;5) receiving data packets from a communications medium; 6) transmittingdata packets onto a communications medium; and 7) data encryption anddecryption.

[0185] Application engines 156, 158, and 160 are coupled to provide datato streaming output engine 162 and receive data from streaming inputengine 154. Application engines 166, 168, and 170 are coupled to providedata to streaming output engine 172 and receive data from streaminginput engine 164.

[0186]FIG. 9b shows an embodiment of coprocessor 62 with applicationengines 156 and 166 designed to perform the data string copyapplication. In this embodiment, engines 156 and 166 are coupled toprovide string copy output data to engine sets 158, 160, and 162, and168, 170, and 172, respectively. FIG. 9c shows an embodiment ofcoprocessor 62, where engine 160 is a transmission media accesscontroller (“TxMAC”) and engine 170 is a reception media accesscontroller (RxMAC”). TxMAC 160 transmits packets onto a communicationsmedium, and RxMAC 170 receives packets from a communications medium.These two engines will be described in greater detail below.

[0187] One advantage of the embodiment of coprocessor 62 shown in FIGS.9a-9 c is the modularity. Coprocessor 62 can easily be customized toaccommodate many different applications. For example, in one embodimentonly one compute engine receives and transmits network packets. In thiscase, only one coprocessor contains an RxMAC and TxMAC, while othercoprocessors in MPU 10 are customized with different data processingapplications. Coprocessor 62 supports modularity by providing a uniforminterface to application engines, except streaming input engines 154 and164 and streaming output engines 162 and 172.

[0188] 2. Sequencer

[0189]FIG. 10 shows an interface between CPU 60 and sequencers 150 and152 in coprocessor 62 in one embodiment of the present invention. CPU 60communicates with sequencer 150 and 152 through data registers 180 and184, respectively, and control registers 182 and 186, respectively. CPU60 has address lines and data lines coupled to the above-listedregisters. Data registers 180 and control registers 182 are each coupledto exchange information with micro-code engine and logic block 188.Block 188 interfaces to the engines in coprocessor 62. Data register 184and control registers 186 are each coupled to exchange information withmicro-code engine and logic block 190. Block 190 interfaces to theengines in coprocessor 62.

[0190] CPU 60 is coupled to exchange the following signals withsequencers 150 and 152: 1) Interrupt (INT)—outputs from sequencers 150and 152 indicating an assigned application is complete; 2) ReadAllowed—outputs from sequencers 150 and 152 indicating access to dataand control registers is permissible; 3) Running—outputs from sequencers150 and 152 indicating that an assigned application is complete; 4)Start—outputs from CPU 60 indicating that sequencer operation is tobegin; and 5) Opcode—outputs from CPU 60 identifying the set ofmicro-code instructions for the sequencer to execute after the assertionof Start.

[0191] In operation, CPU 60 offloads performance of assignedapplications to coprocessor 62. CPU 60 instructs sequencers 150 and 152by writing instructions and data into respective data registers 180 and182 and control registers 184 and 186. The instructions forwarded by CPU60 prompt either sequencer 150 or sequencer 152 to begin executing aroutine in the sequencer's micro-code. The executing sequencer eitherperforms the application by running a micro-code routine or instructingan application engine to perform the offloaded application. While theapplication is running, the sequencer asserts the Running signal, andwhen the application is done the sequencer asserts the Interrupt signal.This allows CPU 60 to detect and respond to an application's completioneither by polling the Running signal or employing interrupt serviceroutines.

[0192]FIG. 11 shows an interface between sequencer 150 and its relatedapplication engines in one embodiment of the present invention. The sameinterface is employed for sequencer 152.

[0193] Output data interface 200 and input data interface 202 ofsequencer 150 are coupled to engines 156, 158, and 160. Output datainterface 200 provides data to engines 156, 158, and 160, and input datainterface 202 retrieves data from engines 156, 158, and 160. In oneembodiment, data interfaces 200 and 202 are each 32 bits wide.

[0194] Sequencer 150 provides enable output 204 to engines 156, 158, and160. Enable output 204 indicates which application block is activated.In one embodiment of the present invention, sequencer 150 only activatesone application engine at a time. In such an embodiment, applicationengines 156, 158, and 160 each receive a single bit of enable output204—assertion of that bit indicates the receiving application engine isactivated. In alternate embodiments, multiple application engines areactivated at the same time.

[0195] Sequencer 150 also includes control interface 206 coupled toapplication engines 156, 158, and 160. Control interface 206 manages theexchange of data between sequencer 150 and application engines 156, 158,and 160. Control interface 206 supplies the following signals:

[0196] 1) register read enable—enabling data and control registers onthe activated application engine to supply data on input data interface202;

[0197] 2) register write enable—enabling data and control registers onthe activated application engine to accept data on output data interface200;

[0198] 3) register address lines—providing addresses to applicationengine registers in conjunction with the data and control registerenable signals; and

[0199] 4) arbitrary control signals—providing unique interface signalsfor each application engine. The sequencer's micro-code programs thearbitrary control bits to operate differently with each applicationengine to satisfy each engine's unique interface needs.

[0200] Once sequencer 150 receives instruction from CPU 60 to carry outan application, sequencer 150 begins executing the micro-code routinesupporting that application. In some instances, the micro-codeinstructions carry out the application without using any applicationengines. In other instances, the micro-code instructions cause sequencer150 to employ one or more application engines to carry out anapplication.

[0201] When sequencer 150 employs an application engine, the micro-codeinstructions cause sequencer 150 to issue an enable signal to the engineon enable interface 204. Following the enable signal, the micro-codedirects sequencer 150 to use control interface 206 to initialize anddirect the operation of the application engine. Sequencer 150 providescontrol directions by writing the application engine's control registersand provides necessary data by writing the application engine's dataregisters. The microcode also instructs sequencer 150 to retrieveapplication data from the application engine. An example of thesequencer-application interface will be presented below in thedescription of RxMAC 170 and TxMAC 160.

[0202] Sequencer 150 also includes a streaming input (SI) engineinterface 208 and streaming output (SO) engine interface 212. Theseinterfaces couple sequencer 150 to streaming input engine 154 andstreaming output engine 162. The operation of these interfaces will beexplained in greater detain below.

[0203] Streaming input data bus 210 is coupled to sequencer 150,streaming input engine 154, and application engines 156, 158, and 160.Streaming input engine 154 drives bus 210 after retrieving data frommemory. In one embodiment, bus 210 is 16 bytes wide. In one suchembodiment, sequencer 150 is coupled to retrieve only 4 bytes of databus 210.

[0204] Streaming output bus 211 is coupled to sequencer 150, streamingoutput engine 162 and application engines 156, 158, and 160. Applicationengines deliver data to streaming output engine 162 over streamingoutput bus 211, so streaming output engine 162 can buffer the data tomemory. In one embodiment, bus 211 is 16 bytes wide. In one suchembodiment, sequencer 150 only drives 4 bytes on data bus 211.

[0205] 3. Streaming Input Engine

[0206]FIG. 12 shows streaming input engine 154 in one embodiment of thepresent invention. Streaming input engine 154 retrieves data from memoryin MPU 10 at the direction of sequencer 150. Sequencer 150 providesstreaming input engine 154 with a start address and data size value forthe block of memory to be retrieved. Streaming input engine 154 respondsby retrieving the identified block of memory and providing it onstreaming data bus 210 in coprocessor 62. Streaming input engine 154provides data in programmable word sizes on bus 210, in response tosignals on SI control interface 208.

[0207] Fetch and pre-fetch engine 226 provides instructions (MemoryOpcode) and addresses for retrieving data from memory. Alignment circuit228 receives the addressed data and converts the format of the data intothe alignment desired on streaming data bus 210. In one embodiment,engine 226 and alignment circuit 228 are coupled to first tier datacache 52 through arbiter 176 (FIGS. 9a-9 c).

[0208] Alignment circuit 228 provides the realigned data to register230, which forwards the data to data bus 210. Mask register 232 providesa mask value identifying the output bytes of register 230 that arevalid. In one embodiment, fetch engine 226 addresses 16 byte words inmemory, and streaming input engine 154 can be programmed to providewords with sizes of either: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 16 bytes.

[0209] Streaming input engine 154 includes configuration registers 220,222, and 224 for receiving configuration data from sequencer 150.Registers 220, 222, and 224 are coupled to data signals on SI controlinterface 208 to receive a start address, data size, and modeidentifier, respectively. Registers 220, 222, and 224 are also coupledto receive the following control strobes from sequencer 150 via SIcontrol interface 208: 1) start address strobe—coupled to start addressregister 220; 2) data size strobe—coupled to data size register 222; and3) mode strobe—coupled to mode register 224. Registers 220, 222, and 224each capture the data on output data interface 200 when sequencer 150asserts their respective strobes.

[0210] In operation, fetch engine 226 fetches the number of bytesidentified in data size register 222, beginning at the start address inregister 220. In one embodiment, fetch engine 226 includes a pre-fetchoperation to increase the efficiency of memory fetches. Fetch engine 226issues pre-fetch instructions prior to addressing memory. In response tothe pre-fetch instructions, MPU 10 begins the process of mapping thememory block being accessed by fetch engine 226 into data cache 52 (SeeFIGS. 3 and 4).

[0211] In one embodiment, fetch engine 226 calls for MPU 10 to pre-fetchthe first three 64 byte cache lines of the desired memory block. Next,fetch engine 226 issues load instructions for the first 64 byte cacheline of the desired memory block. Before each subsequent loadinstruction for the desired memory block, fetch engine 226 issuespre-fetch instructions for the two cache lines following the previouslypre-fetched lines. If the desired memory block is less than three cachelines, fetch engine 226 only issues pre-fetch instructions for thenumber of lines being sought. Ideally, the pre-fetch operations willresult in data being available in data cache 52 when fetch engine 226issues load instructions.

[0212] SI control interface 208 includes the following additionalsignals: 1) abort—asserted by sequencer 150 to halt a memory retrievaloperation; 2) start—asserted by sequencer 150 to begin a memoryretrieval operations; 3) done—asserted by streaming input engine 154when the streaming input engine is drained of all valid data; 4) DataValid—asserted by streaming input engine 154 to indicate engine 154 isproviding valid data on data bus 210; 5) 16 Byte Size & Advance-assertedby sequencer 150 to call for a 16 byte data output on data bus 210; and6) 9 Byte Size & Advance—asserted by sequencer 150 to call for either 0,1, 2, 3, 4, 5, 6, 7, or 8 byte data output on data bus 210.

[0213] In one embodiment, alignment circuit 228 includes buffer 234,byte selector 238, register 236, and shifter 240. Buffer 234 is coupledto receive 16 byte data words from data cache 52 through arbiter 176.Buffer 234 supplies data words on its output in the order the data wordswere received. Register 236 is coupled to receive 16 byte data wordsfrom buffer 234. Register 236 stores the data word that resided on theoutput of buffer 234 prior to the word stored in register 236.

[0214] Byte selector 238 is coupled to receive the data word stored inregister 236 and the data word on the output of buffer 234. Byteselector 238 converts the 32 byte input into a 24 byte output, which iscoupled to shifter 240. The 24 bytes follow the byte last provided toregister 230. Register 236 loads the output of buffer 234 and buffer 234outputs the next 16 bytes, when the 24 bytes extends beyond the mostsignificant byte on the output of buffer 234. Shifter 240 shifts the 24byte input, so the next set of bytes to be supplied on data bus 210appear on the least significant bytes of the output of shifter 240. Theoutput of shifter 240 is coupled to register 230, which transfers theoutput of shifter 240 onto data bus 210.

[0215] Shifter 240 is coupled to supply the contents of mask 232 andreceive the 9 Byte Size & Advance signal. The 9 Byte Size & Advancesignal indicates the number of bytes to provide in register 230 fortransfer onto streaming data bus 210. The 9 Byte Size & Advance signalcovers a range of 0 to 8 bytes. When the advance bit of the signal isdeasserted, the entire signal is ignored. Using the contents of the 9Byte Size & Advance signal, shifter 240 properly aligns data in register230 so the desired number of bytes for the next data transfer appear inregister 230 starting at the least significant byte.

[0216] The 16 Byte Size & Advance signal is coupled to buffer 234 andbyte selector 238 to indicate that a 16 byte transfer is required ondata bus 210. In response to this signal, buffer 234 immediately outputsthe next 16 bytes, and register 236 latches the bytes previously on theoutput of buffer 234. When the advance bit of the signal is deasserted,the entire signal is ignored.

[0217] In one embodiment, mode register 224 stores two mode bits. Thefirst bit controls the assertion of the data valid signal. If the firstbit is set, streaming input engine 154 asserts the data valid signalonce there is valid data in buffer 234. If the first bit is not set,streaming input engine 154 waits until buffer 234 contains at least 32valid bytes before asserting data valid. The second bit controls thedeassertion of the data valid signal. When the second bit is set, engine154 deasserts data valid when the last byte of data leaves buffer 234.Otherwise, engine 154 deasserts data valid when buffer 234 contains lessthan 16 valid data bytes.

[0218] 4. Streaming Output Engine

[0219]FIG. 13 illustrates one embodiment of streaming output engine 162in coprocessor 62. Streaming output engine 162 receives data fromstreaming data bus 211 and stores the data in memory in MPU 10.Streaming data bus 211 provides data to alignment block 258 and masksignals to mask register 260. The mask signals identify the bytes onstreaming data bus 211 that are valid. Alignment block 258 arranges theincoming data into its proper position in a 16 byte aligned data word.Alignment block 258 is coupled to buffer 256 to provide the properlyaligned data.

[0220] Buffer 256 maintains the resulting 16 byte data words until theyare written into memory over a data line output of buffer 256, which iscoupled to data cache 52 via arbiter 176. Storage engine 254 addressesmemory in MPU 10 and provides data storage opcodes over its address andmemory opcode outputs. The address and opcode outputs of storage engine254 are coupled to data cache 52 via arbiter 176. In one embodiment,storage engine 254 issues 16 byte aligned data storage operations.

[0221] Streaming output buffer 162 includes configuration registers 250and 252. Registers 250 and 252 are coupled to receive data fromsequencer 150 on data signals in SO control interface 212. Register 250is coupled to a start address strobe provided by sequencer 150 on SOcontrol interface 212. Register 250 latches the start address datapresented on interface 212 when sequencer 150 asserts the start addressstrobe. Register 252 is coupled to a mode address strobe provided bysequencer 150 on SO control bus 212. Register 252 latches the mode datapresented on interface 212 when sequencer 150 asserts the mode strobe.

[0222] In one embodiment, mode configuration register 252 contains 2bits. A first bit controls a cache line burst mode. When this bit isasserted, streaming output engine 162 waits for a full cache line wordto accumulate in engine 162 before storing data to memory. When thefirst bit is not asserted, streaming output engine 162 waits for atleast 16 bytes to accumulate in engine 162 before storing data tomemory.

[0223] The second bit controls assertion of the store-create instructionby coprocessor 62. If the store-create mode bit is not asserted, thencoprocessor 62 doesn't assert the store-create opcode. If thestore-create bit is asserted, storage engine 254 issues the store-createopcode under the following conditions: 1) If cache line burst mode isenabled, streaming output engine 162 is storing the first 16 bytes of acache line, and engine 162 has data for the entire cache line; and 2) Ifcache line burst mode is not enabled, streaming output engine 162 isstoring the first 16 bytes of a cache line, and engine 162 has 16 bytesof data for the cache line.

[0224] SO control interface 212 includes the following additionalsignals: 1) Done—asserted by sequencer 150 to instruct streaming outputengine 162 that no more data is being provided on data bus 210; 2)Abort—provided by sequencer 150 to instruct streaming output engine 162to flush buffer 256 and cease issuing store opcodes; 3) Busy—supplied bystreaming output engine 162 to indicate there is data in buffer 256 tobe transferred to memory; 4) Align Opcode & Advance—supplied bysequencer 150 to identify the number of bytes transferred in a singledata transfer on data bus 211. The align opcode can identify 4, 8 or 16byte transfers in one embodiment. When the advance bit is deasserted,the align opcode is ignored by streaming output engine 162; and 5)Stall—supplied by streaming output engine 162 to indicate buffer 256 isfull. In response to receiving the Stall signal, sequencer 150 stallsdata transfers to engine 162.

[0225] Alignment block 258 aligns incoming data from streaming data bus211 in response to the alignment opcode and start address registervalue. FIG. 14 shows internal circuitry for buffer 256 and alignmentblock 258 in one embodiment of the invention. Buffer 256 supplies a 16byte aligned word from register 262 to memory on the output data lineformed by the outputs of register 262. Buffer 256 internally maintains 4buffers, each storing 4 byte data words received from alignment block256. Data buffer 270 is coupled to output word register 262 to providethe least significant 4 bytes (0-3). Data buffer 268 is coupled tooutput word register 262 to provide bytes 4-7. Data buffer 266 iscoupled to output word register 262 to provide bytes 8-11. Data buffer264 is coupled to output word register 262 to provide the mostsignificant bytes (12-15).

[0226] Alignment block 258 includes multiplexers 272, 274, 276, and 278to route data from streaming data bus 211 to buffers 264, 266, 268, and270. Data outputs from multiplexers 272, 274, 276, and 278 are coupledto provide data to the inputs of buffers 264, 266, 268, and 270,respectively. Each multiplexer includes four data inputs. Each input iscoupled to a different 4 byte segment of streaming data bus 211. A firstmultiplexer data input receives bytes 0-3 of data bus 211. A secondmultiplexer data input receives bytes 4-7 of data bus 211. A thirdmultiplexer input receives bytes 8-11 of data bus 211. A fourthmultiplexer data input receives bytes 12-15 of data bus 211.

[0227] Each multiplexer also includes a set of select signals, which aredriven by select logic 280. Select logic 280 sets the select signals formultiplexers 272, 274, 276, and 278, based on the start address inregister 252 and the Align Opcode & Advance Signal. Select logic 280ensures that data from streaming data bus 211 is properly aligned inoutput word register 262.

[0228] For example, the start address may start at byte 4, and the AlignOpcode calls for 4 byte transfers on streaming data bus 211. The first12 bytes of data received from streaming data bus 211 must appear inbytes 415 of output register 262.

[0229] When alignment block 258 receives the first 4 byte transfer onbytes 0-3 of bus 211, select logic 280 enables multiplexer 276 to passthese bytes to buffer 268. When alignment block 258 receives the second4 byte transfer, also appearing on bytes 0-3 of bus 211, select logic280 enables multiplexer 274 to pass bytes 0-3 to buffer 266. Whenalignment block 258 receives the third 4 byte transfer, also appearingon bytes 0-3 of bus 211, select logic 280 enables multiplexer 272 topass bytes 0-3 to buffer 264. As a result, when buffer 256 performs its16 byte aligned store to memory, the twelve bytes received from data bus211 appear in bytes 4-15 of the stored word.

[0230] In another example, the start address starts at byte 12, and theAlign Opcode calls for 8 byte transfers on streaming data bus 211.Alignment block 258 receives the first 8 byte transfer on bytes 0-7 ofbus 211. Select logic 280 enables multiplexer 272 to pass bytes 0-3 ofbus 211 to buffer 264 and enables multiplexer 278 to pass bytes 4-7 ofbus 211 to buffer 270. Alignment block 258 receives the second 8 bytetransfer on bytes 0-7 of bus 211. Select logic 280 enables multiplexer276 to pass bytes 0-3 of bus 211 to buffer 268 and enables multiplexer274 to pass bytes 4-7 of bus 211 to buffer 266. Register 262 transfersthe newly recorded 16 bytes to memory in 2 transfers. The first transferpresents the least significant 4 bytes of the newly received 16 bytetransfer in bytes 12-15. The second transfer presents 12 bytes of thenewly received data on bytes 0-11.

[0231] One of ordinary skill will recognize that FIG. 14 only shows onepossible embodiment of buffer 256 and alignment block 258. Otherembodiments are possible using well known circuitry to achieve theabove-described functionality.

[0232] 5. RxMAC and Packet Reception

[0233] a. RxMAC

[0234]FIG. 15 illustrates one embodiment of RxMAC 170 in accordance withthe present invention. RxMAC 170 receives data from a network andforwards it to streaming output engine 162 for storing in MPU 10 memory.The combination of RxMAC 170 and streaming output engine 162 enables MPU10 to directly write network data to cache memory, without first beingstored in main memory 26.

[0235] RxMAC 170 includes media access controller (“MAC”) 290, buffer291, and sequencer interface 292. In operation, MAC 290 is coupled to acommunications medium through a physical layer device (not shown) toreceive network data, such as data packets. MAC 290 performs the mediaaccess controller operations required by the network protocol governingdata transfers on the coupled communications medium. Example of MACoperations include: 1) framing incoming data packets; 2) filteringincoming packets based on destination addresses; 3) evaluating FrameCheck Sequence (“FCS”) checksums; and 4) detecting packet receptionerrors.

[0236] In one embodiment, MAC 290 conforms to the IEEE 802.3 Standardfor a communications network supporting GMII Gigabit Ethernet. In onesuch embodiment, the MAC 290 network interface includes the followingsignals from the IEEE 802.3z Standard: 1) RXD—an input to MAC 290providing 8 bits of received data; 2) RX_DV—an input to MAC 290indicating RXD is valid; 3) RX_ER—an input to MAC 290 indicating anerror in RXD; and 4) RX_CLK—an input to MAC 290 providing a 125 MHzclock for timing reference for RXD.

[0237] One of ordinary skill will recognize that in alternateembodiments of the present invention MAC 290 includes interfaces tophysical layer devices conforming to different network standards. Onesuch standard is the IEEE 802.3 standard for MII 100 megabit per secondEthernet.

[0238] In one embodiment of the invention, RxMAC 170 also receives andframes data packets from a point-to-point link with a device thatcouples MPUs together. One such device is described in U.S. patentapplication Ser. No. ______, entitled Cross-Bar Switch, filed on Jul. 6,2001, having Attorney Docket No. NEXSI-01022US0. In one such embodiment,the point-to-point link includes signaling that conforms to the IEEE802.3 Standard for GMII Gigabit Ethernet MAC interface operation.

[0239] MAC 290 is coupled to buffer 291 to provide framed words (MACData) from received data packets. In one embodiment, each word contains8 bits, while in other embodiments alternate size words can be employed.Buffer 291 stores a predetermined number of framed words, then transfersthe words to streaming data bus 211. Streaming output engine 162 storesthe transferred data in memory, as will be described below in greaterdetail. In one such embodiment, buffer 291 is a first-in-first-out(“FIFO”) buffer.

[0240] As listed above, MAC 290 monitors incoming data packets forerrors. In one embodiment, MAC 290 provides indications of whether thefollowing occurred for each packet: 1) FCS error; 2) address mismatch;3) size violation; 4) overflow of buffer 291; and 5) RX_ER signalasserted. In one such embodiment, this information is stored in memoryin MPU 10, along with the associated data packet.

[0241] RxMAC 170 communicates with sequencer 150 through sequencerinterface 292. Sequencer interface 292 is coupled to receive data onsequencer output data bus 200 and provide data on sequencer input databus 202. Sequencer interface 292 is coupled to receive a signal fromenable interface 204 to inform RxMAC 170 whether it is activated.

[0242] Sequencer 150 programs RxMAC 170 for operation through controlregisters (not shown) in sequencer interface 292. Sequencer 150 alsoretrieves control information about RxMAC 170 by querying registers insequencer interface 292. Sequencer interface 292 is coupled to MAC 290and buffer 291 to provide and collect control register information.

[0243] Control registers in sequencer interface 292 are coupled tosequencer input data bus 202 and output data bus 200. The registers arealso coupled to sequencer control bus 206 to provide for addressing andcontrolling register store and load operations. Sequencer 150 writes oneof the control registers to define the mode of operation for RxMAC 170.In one mode, RxMAC 170 is programmed for connection to a communicationsnetwork and in another mode RxMAC 170 is programmed to theabove-described point-to-point link to another device. Sequencer 150employs another set of control registers to indicate the destinationaddresses for packets that RxMAC 170 is to accept. Sequencer interface292 provides the following signals in control registers that areaccessed by sequencer 150: 1) End of Packet—indicating the last word fora packet has left buffer 291; 2) Bundle Ready—indicating buffer 291 hasaccumulated a predetermined number of bytes for transfer on streamingdata bus 210; 3) Abort—indicating an error condition has been detected,such as an address mismatch, FCS error, or buffer overflow; and 4)Interrupt—indicating sequencer 150 should execute an interrupt serviceroutine, typically for responding to MAC 290 losing link to thecommunications medium. Sequencer interface 292 is coupled to MAC 290 andbuffer 291 to receive the information necessary for controlling theabove-described signals.

[0244] Sequencer 150 receives the above-identified signals in responseto control register reads that access control registers containing thesignals. In one embodiment, a single one bit register provides all thecontrol signals in response to a series of register reads by sequencer150. In an alternate embodiment, the control signals are provided oncontrol interface 206. Sequencer 150 responds to the control signals byexecuting operations that correspond to the signals—this will bedescribed in greater detail below. In one embodiment, sequencer 150executes corresponding micro-code routines in response to the signals.Once sequencer 150 receives and responds to one of the above-describedsignals, sequencer 150 performs a write operation to a control registerin sequencer interface 292 to deassert the signal.

[0245] b. Packet Reception

[0246]FIG. 16 illustrates a process for receiving data packets usingcoprocessor 62 in one embodiment of the present invention. CPU 60initializes sequencer 152 for managing packet receptions (step 300). CPU60 provides sequencer 150 with addresses in MPU memory for coprocessor62 to store data packets. One data storage scheme for use with thepresent invention appears in detail below.

[0247] After being initialized by CPU 60, sequencer 152 initializesRxMAC 170 (step 301) and streaming output engine 172 (step 302). CPU 60provides RxMAC 170 with an operating mode for MAC 290 and thedestination addresses for data packets to be received. CPU 60 providesstreaming output engine 172 with a start address and operating modes.The starting address is the memory location where streaming outputengine 172 begins storing the next incoming packet. In one embodiment,sequencer 152 sets the operating modes as follows: 1) the cache lineburst mode bit is not asserted; and 2) the store-create mode bit isasserted. As described above, initializing streaming output engine 172causes it to begin memory store operations.

[0248] Once initialization is complete, sequencer 152 determines whetherdata needs to be transferred out of RxMAC 170 (step 304). Sequencer 152monitors the bundle ready signal to make this determination. Once RxMAC170 asserts bundle ready, bytes from buffer 291 in RxMAC 170 aretransferred to streaming output engine 172 (step 306).

[0249] Upon detecting the bundle ready signal (step 304), sequencer 152issues a store opcode to streaming output engine 172. Streaming outputengine 172 responds by collecting bytes from buffer 291 on streamingdata bus 211 (step 306). In one embodiment, buffer 291 places 8 bytes ofdata on the upper 8 bytes of streaming data bus 211, and the opcodecauses engine 172 to accept these bytes. Streaming output engine 172operates as described above to transfer the packet data to cache memory52 (step 306).

[0250] Sequencer 152 also resets the bundle ready signal (step 308).Sequencer 152 resets the bundle ready signal, so the signal can beemployed again once buffer 291 accumulates a sufficient number of bytes.Sequencer 152 clears the bundle ready signal by performing a storeoperation to a control register in sequencer interface 292 in RxMAC 170.

[0251] Next, sequencer 152 determines whether bytes remain to betransferred out of RxMAC 170 (step 310). Sequencer 152 makes thisdetermination by monitoring the end of packet signal from RxMAC 170. IfRxMAC 170 has not asserted the end of packet signal, sequencer 152begins monitoring the bundle ready signal again (step 304). If RxMAC 170has asserted the end of packet signal (step 310), sequencer 152 issuesthe done signal to streaming output engine 172 (step 314).

[0252] Once the done signal is issued, sequencer 152 examines the abortsignal in RxMAC 170 (step 309). If the abort signal is asserted,sequencer 152 performs an abort operation (step 313). After performingthe abort operation, sequencer 152 examines the interrupt signal inRxMAC 170 (step 314). If the interrupt signal is set, sequencer 152executes a responsive interrupt service routine (“ISR”) (step 317).After the ISR or if the interrupt is not set, sequencer 152 returns toinitialize the streaming output engine for another reception (step 302).

[0253] If the abort signal was not set (step 309), sequencer 152 waitsfor streaming output engine 172 to deassert the busy signal (step 316).After sensing the busy signal is deasserted, sequencer 152 examines theinterrupt signal in RxMAC 170 (step 311). If the interrupt is asserted,sequencer 152 performs a responsive ISR (step 315). After the responsiveISR or if the interrupt was not asserted, sequencer 152 performs adescriptor operation (step 318). As part of the descriptor operation,sequencer 152 retrieves status information from sequencer interface 292in RxMAC 170 and writes the status to a descriptor field correspondingto the received packet, as will be described below. Sequencer 152 alsodetermines the address for the next receive packet and writes this valuein a next address descriptor field. Once the descriptor operation iscomplete, sequencer 152 initializes streaming output engine 172 (step302) as described above. This enables MPU 10 to receive another packetinto memory.

[0254]FIG. 17 provides a logical representation of one data managementscheme for use in embodiments of the present invention. During sequencerinitialization (step 300), the data structure shown in FIG. 17 isestablished. The data structure includes entries 360, 362, 364, and 366,which are mapped into MPU 10 memory. Each entry includes N blocks ofbytes. Sequencer 152 maintains corresponding ownership registers 368,370, 372, and 374 for identifying ownership of entries 360, 362, 364,and 366, respectively.

[0255] In one embodiment, each entry includes 32 blocks, and each blockincludes 512 bytes. In one such embodiment, blocks 0 through N−1 arecontiguous in memory and entries 360, 362, 364, and 366 are contiguousin memory.

[0256] Streaming output engine 172 stores data received from RxMAC 170in entries 360, 362, 364, and 366. CPU 60 retrieves the received packetsfrom these entries. As described with reference to FIG. 16, sequencer152 instructs streaming output engine 172 where to store received data(step 302). Sequencer 152 provides streaming input engine 172 with astart address offset from the beginning of a block in an entry owned bysequencer 152. In one embodiment, the offset includes the followingfields: 1) Descriptor—for storing status information regarding thereceived packet; and 2) Next Packet Pointer—for storing a pointer to theblock that holds the next packet. In some instances reserved bytes areincluded after the Next Packet Pointer.

[0257] As described with reference to FIG. 16, sequencer 152 performs adescriptor operation (step 318) to write the Descriptor and Next PacketPointer fields. Sequencer 152 identifies the Next Packet Pointer bycounting the number of bytes received by RxMAC 170. This is achieved inone embodiment by counting the number of bundle ready signals (step 304)received for a packet. In one embodiment, sequencer 152 ensures that theNext Packet Pointer points to the first memory location in a block.Sequencer 152 retrieves information for the Descriptor field fromsequencer interface 292 in RxMAC 170 (FIG. 15).

[0258] In one embodiment, the Descriptor field includes thefollowing: 1) Frame Length—indicating the length of the received packet;2) Frame Done—indicating the packet has been completed; 3) BroadcastFrame—indicating whether the packet has a broadcast address; 4)Multicast Frame—indicating whether the packet is a multicast packetsupported by RxMAC 170; 5) Address Match—indicating whether an addressmatch occurred for the packet; 6) Frame Error—indicating whether thepacket had a reception error; and 7) Frame Error Type—indicating thetype of frame error, if any. In other embodiments, additional anddifferent status information is included in the Descriptor field.

[0259] Streaming output engine 172 stores incoming packet data into asmany contiguous blocks as necessary. If the entry being used runs out ofblocks, streaming output engine 172 buffers data into the first block ofthe next entry, provided sequencer 152 owns the entry. One exception tothis operation is that streaming output engine 172 will not split apacket between entry 366 and 360.

[0260] In one embodiment, 256 bytes immediately following a packet areleft unused. In this embodiment, sequencer 152 skips a block inassigning the next start address (step 318 and step 302) if the lastblock of a packet has less than 256 bytes unused.

[0261] After initialization (step 300), sequencer 152 possessesownership of entries 360, 362, 364, and 366. After streaming outputengine 172 fills an entry, sequencer 152 changes the value in theentry's corresponding ownership register to pass ownership of the entryto CPU 60. Once CPU 60 retrieves the data in an entry, CPU 60 writes theentry's corresponding ownership register to transfer entry ownership tosequencer 152. After entry 366 is filled, sequencer 152 waits forownership of entry 360 to be returned before storing any more packets.

[0262] 6. TxMAC and Packet Transmission

[0263] a. TxMAC

[0264]FIG. 18 illustrates one embodiment of TxMAC 160 in accordance withthe present invention. TxMAC 160 transfers data from MPU 10 to a networkinterface for transmission onto a communications medium. TxMAC 160operates in conjunction with streaming input engine 154 to directlytransfer data from cache memory to a network interface, without firstbeing stored in main memory 26.

[0265] TxMAC 160 includes media access controller (“MAC”) 320, buffer322, and sequencer interface 324. In operation, MAC 320 is coupled to acommunications medium through a physical layer device (not shown) totransmit network data, such as data packets. As with MAC 290, MAC 320performs the media access controller operations required by the networkprotocol governing data transfers on the coupled communications medium.Example of MAC transmit operations include, 1) serializing outgoing datapackets; 2) applying FCS checksums; and 3) detecting packet transmissionerrors.

[0266] In one embodiment, MAC 320 conforms to the IEEE 802.3 Standardfor a communications network supporting GMII Gigabit Ethernet. In onesuch embodiment, the MAC 320 network interface includes the followingsignals from the IEEE 802.3z Standard: 1) TXD—an output from MAC 320providing 8 bits of transmit data; 2) TX_EN—an output from MAC 320indicating TXD has valid data; 3) TX_ER—an output of MAC 320 indicatinga coding violation on data received by MAC 320; 4) COL—an input to MAC320 indicating there has been a collision on the coupled communicationsmedium; 5) GTX_CLK—an output from MAC 320 providing a 125 MHz clocktiming reference for TXD; and 6) TX_CLK—an output from MAC 320 providinga timing reference for TXD when the communications network operates at10 megabits per second or 100 megabits per second.

[0267] One of ordinary skill will recognize that in alternateembodiments of the present invention MAC 320 includes interfaces tophysical layer devices conforming to different network standards. In onesuch embodiment, MAC 320 implements a network interface for the IEEE802.3 standard for MII 100 megabit per second Ethernet.

[0268] In one embodiment of the invention, TxMAC 160 also transmits datapackets to a point-to-point link with a device that couples MPUstogether, such as the device described in U.S. patent application Ser.No. ______, entitled Cross-Bar Switch, filed on Jul. 6, 2001, havingAttorney Docket No. NEXSI-01022US0. In one such embodiment, thepoint-to-point link includes signaling that conforms to the GMII MACinterface specification.

[0269] MAC 320 is coupled to buffer 322 to receive framed words for datapackets. In one embodiment, each word contains 8 bits, while in otherembodiments alternate size words are employed. Buffer 322 receives datawords from streaming data bus 210. Streaming input engine 154 retrievesthe packet data from memory, as will be described below in greaterdetail. In one such embodiment, buffer 322 is a first-in-first-out(“FIFO”) buffer.

[0270] As explained above, MAC 320 monitors outgoing data packettransmissions for errors. In one embodiment, MAC 320 providesindications of whether the following occurred for each packet: 1)collisions; 2) excessive collisions; and 3) underflow of buffer 322.

[0271] TxMAC 160 communicates with sequencer 150 through sequencerinterface 324. Sequencer interface 324 is coupled to receive data onsequencer output bus 200 and provide data on sequencer input bus 202.Sequencer interface 324 is coupled to receive a signal from enableinterface 204 to inform TxMAC 160 whether it is activated.

[0272] Sequencer 150 programs TxMAC 160 for operation through controlregisters (not shown) in sequencer interface 324. Sequencer 150 alsoretrieves control information about TxMAC 160 by querying these sameregisters. Sequencer interface 324 is coupled to MAC 320 and buffer 322to provide and collect control register information.

[0273] The control registers in sequencer interface 324 are coupled toinput data bus 202 and output data bus 200. The registers are alsocoupled to control interface 206 to provide for addressing andcontrolling register store and load operations. Sequencer 150 writes oneof the control registers to define the mode of operation for TxMAC 160.In one mode, TxMAC 160 is programmed for connection to a communicationsnetwork and in another mode TxMAC 160 is programmed to theabove-described point-to-point link to another device. Sequencer 150employs a register in TxMAC's set of control registers to indicate thenumber of bytes in the packet TxMAC 160 is sending.

[0274] Sequencer interface 324 provides the following signals tosequencer control interface 206: 1) Retry—indicating a packet was notproperly transmitted and will need to be resent; 2) PacketDone—indicating the packet being transmitted has left MAC 320; and 3)Back-off—indicating a device connecting MPUs in the above-describedpoint-to-point mode cannot receive a data packet at this time and thepacket should be transmitted later.

[0275] Sequencer 150 receives the above-identified signals and respondsby executing operations that correspond to the signals—this will bedescribed in greater detail below. In one embodiment, sequencer 150executes corresponding micro-code routines in response to the signals.Once sequencer 150 receives and responds to one of the above-describedsignals, sequencer 150 performs a write operation to a control registerin sequencer interface 320 to deassert the signal.

[0276] Sequencer 324 receives an Abort signal from sequencer controlinterface 206. The Abort signal indicates that excessive retries havebeen made in transmitting a data packet and to make no further attemptsto transmit the packet. Sequencer interface 324 is coupled to MAC 320and buffer 322 to receive information necessary for controlling theabove-described signals and forwarding instructions from sequencer 150.

[0277] In one embodiment, sequencer interface 324 also provides the 9Byte Size Advance signal to streaming input engine 154.

[0278] b. Packet Transmission

[0279]FIG. 19 illustrates a process MPU 10 employs in one embodiment ofthe present invention to transmit packets. At the outset, CPU 60initializes sequencer 150 (step 330). CPU 60 instructs sequencer 150 totransmit a packet and provides sequencer 150 with the packet's size andaddress in memory. Next, sequencer 150 initializes TxMAC 160 (step 332)and streaming input engine 154 (step 334).

[0280] Sequencer 150 writes to control registers in sequencer interface324 to set the mode of operation and size for the packet to betransmitted. Sequencer 150 provides the memory start address, data size,and mode bits to streaming input engine 154. Sequencer 150 also issuesthe Start signal to streaming input engine 154 (step 336), which resultsin streaming input engine 154 beginning to fetch packet data from datacache 52.

[0281] Sequencer 150 and streaming input engine 154 combine to transferpacket data to TxMAC 160 (step 338). TxMAC 160 supplies the 9 Byte SizeSignal to transfer data one byte at a time from streaming input engine154 to buffer 322 over streaming data bus 210. Upon receiving thesebytes, buffer 322 begins forwarding the bytes to MAC 320, whichserializes the bytes and transmits them to a network interface (step340). As part of the transmission process, TxMAC 160 decrements thepacket count provided by sequencer 150 when a byte is transferred tobuffer 322 from streaming input engine 154. In an alternate embodiment,sequencer 150 provides the 9 Byte Size Signal.

[0282] During the transmission process, MAC 320 ensures that MAC leveloperations are performed in accordance with appropriate networkprotocols, including collision handling. If a collision does occur,TxMAC 320 asserts the Retry signal and the transmission process restartswith the initialization of TxMAC 160 (step 332) and streaming inputengine 154 (step 334).

[0283] While TxMAC 160 is transmitting, sequencer 150 waits for TxMAC160 to complete transmission (step 342). In one embodiment, sequencer150 monitors the Packet Done signal from TxMAC 160 to determine whentransmission is complete. Sequencer 150 can perform this monitoring bypolling the Packet Done signal or coupling it to an interrupt input.

[0284] Once Packet Done is asserted, sequencer 150 invalidates thememory location where the packet data was stored (step 346). Thisalleviates the need for MPU 10 to update main memory when reassigningthe cache location that stored the transmitted packet. In oneembodiment, sequencer 150 invalidates the cache location by issuing aline invalidation instruction to data cache 52.

[0285] After invalidating the transmit packet's memory location,sequencer 150 can transmit another packet. Sequencer 150 initializesTxMAC 160 (step 332) and streaming input engine 154 (step 334) and theabove-described transmission process is repeated.

[0286] In one embodiment of the invention, the transmit process employsa bandwidth allocation procedure for enhancing quality of service.Bandwidth allocation allows packets to be assigned priority levelshaving a corresponding amount of allocated bandwidth. In one suchembodiment, when a class exhausts its allocated bandwidth no furthertransmissions may be made from that class until all classes exhausttheir bandwidth—unless the exhausted class is the only class withpackets awaiting transmission.

[0287] Implementing such an embodiment can be achieved by making thefollowing additions to the process described in FIG. 19, as shown inFIG. 20. When CPU 60 initializes sequencer 150 (step 330), CPU 60assigns the packet to a bandwidth class. Sequencer 150 determineswhether there is bandwidth available to transmit a packet with theassigned class (step 331). If not, sequencer 150 informs CPU 60 toselect a packet from another class because the packet's bandwidth classis oversubscribed. The packet with the oversubscribed bandwidth class isselected at a later time (step 350). If bandwidth is available for theassigned class, sequencer 150 continues the transmission processdescribed for FIG. 19 by initializing TxMAC 160 and streaming inputengine 154. After transmission is complete sequencer 150 decrements anavailable bandwidth allocation counter for the transmitted packet'sclass (step 345).

[0288] In one embodiment, MPU 10 employs 4 bandwidth classes, havinginitial bandwidth allocation counts of 128, 64, 32, and 16. Each countis decremented by the number of 16 byte segments in a transmitted packetfrom the class (step 345). When a count reaches or falls below zero, nofurther packets with the corresponding class are transmitted unless noother class with a positive count is attempting to transmit a packet.Once all the counts reach zero or all classes attempting to transmitreach zero, sequencer 150 resets the bandwidth allocation counts totheir initial count values.

[0289] E. Connecting Multiple MPU Engines

[0290] In one embodiment of the invention, MPU 10 can be connected toanother MPU using TxMAC 160 or RxMAC 170. As described above, in onesuch embodiment, TxMAC 160 and RxMAC 170 have modes of operationsupporting a point-to-point link with a cross-bar switch designed tocouple MPUs. One such cross-bar switch is disclosed in theabove-identified U.S. patent application Ser. No. ______, entitledCross-Bar Switch, filed on Jul. 6, 2001, having Attorney Docket No.,NEXSI-01022US0. In alternate embodiments, RxMAC 170 and TxMAC 160support interconnection with other MPUs through bus interfaces and otherwell know linking schemes. In one point-to-point linking embodiment, thenetwork interfaces of TxMAC 160 and RxMAC 170 are modified to takeadvantage of the fact that packet collisions don't occur on apoint-to-point interface. Signals specified by the applicable networkprotocol for collision, such as those found in the IEEE 802.3Specification, are replaced with a hold-off signal.

[0291] In such an embodiment, RxMAC 170 includes a hold-off signal thatRxMAC 170 issues to the interconnect device to indicate RxMAC 170 cannotreceive more packets. In response, the interconnect device will nottransmit any more packets after the current packet, until hold-off isdeasserted. Other than this modification, RxMAC 170 operates the same asdescribed above for interfacing to a network.

[0292] Similarly, TxMAC 160 includes a hold-off signal input in oneembodiment. When TxMAC 160 receives the hold-off signal from theinterconnect device, TxMAC halts packet transmission and issues theBack-off signal to sequencer 150. In response, sequencer 150 attempts totransmit the packet at a later time. Other than this modification, TxMAC160 operates the same as described above for interfacing to a network.

[0293] The foregoing detailed description has been presented forpurposes of illustration and description. It is not intended to beexhaustive or to limit the invention to the precise form disclosed, andobviously many modifications and variations are possible in light of theabove teaching. The described embodiments were chosen in order to bestexplain the principles of the invention and its practical application tothereby enable others skilled in the art to best utilize the inventionin various embodiments and with various modifications as are suited tothe particular use contemplated. One of ordinary skill in the art willrecognize that additional embodiments of the present invention can bemade without undue experimentation by combining aspects of theabove-described embodiments. It is intended that the scope of theinvention be defined by the claims appended hereto.

We claim:
 1. A processing cluster for use in a system with a mainmemory, said processing cluster comprising: a set of cache memoryadapted for coupling to said main memory; and a set of compute engines,wherein each compute engine in said set of compute engines is coupled tosaid set of cache memory and includes a central processing unit coupledto a coprocessor, wherein said coprocessor is coupled to said set ofcache memory.
 2. The processing cluster of claim 1, wherein saidcoprocessor includes: a sequencer coupled to said central processingunit; and a set of application engines coupled to said sequencer.
 3. Theprocessing cluster of claim 2, wherein said set of application enginesincludes a streaming input engine coupled to said set of cache memory toretrieve data from said set of cache memory.
 4. The processing clusterof claim 3, wherein said streaming input engine is coupled to at leastone application engine in said set of application engines to providedata retrieved from said set of set of cache memory.
 5. The processingcluster of claim 4, wherein said streaming input engine retrieves a fullcache line of data from said set of cache memory and provides said cacheline of data to said at least one application engine in transfersincluding a programmable number of bytes.
 6. The processing cluster ofclaim 5, wherein said streaming input engine retrieves said full cacheline in a plurality of data retrievals from said set of cache memory. 7.The processing cluster of claim 2, wherein said set of applicationengines includes a streaming output engine coupled to said set of cachememory to store data in said set of cache memory.
 8. The processingcluster of claim 7, wherein said streaming output engine is coupled toat least one application engine in said set of application engines tostore data from said at least one application engine in said set ofcache memory.
 9. The processing cluster of claim 8, wherein saidstreaming output engine retrieves a programmable number of bytes from afull cache line of data from said at least one application engine andtransfers a full cache line of data retrieved from said at least oneapplication engine to said set of cache memory.
 10. The processingcluster of claim 9, wherein said streaming output engine transfers saidfull cache line of data retrieved from said at least one applicationengine to said set of cache memory using a plurality of transferoperations.
 11. The processing cluster of claim 2, wherein said set ofapplication engines includes: a streaming input engine coupled to saidset of cache memory to retrieve data from said set of cache memory; anda transmission media access controller coupled to said streaming inputbuffer to receive said data and provide said data to a communicationsnetwork.
 12. The processing cluster of claim 2, wherein said set ofapplication engines includes: a streaming output buffer coupled to saidset of cache memory to store data in said set of cache memory; and areception media access controller coupled to said streaming outputbuffer to provide network data received from a communications network.13. The processing cluster of claim 2, wherein said central processingunit provides instructions to said sequencer identifying an operation toperform.
 14. The processing cluster of claim 13, wherein said sequencerinstructs an application engine in said set of application engines toperform said operation in response to the instructions provided by saidcentral processing unit.
 15. The processing cluster of claim 1, furtherincluding: a memory management unit coupled to at least 2 computeengines in said set of compute engines.
 16. The processing cluster ofclaim 15, wherein said memory management unit provides addresstranslations for said at least 2 compute engines on a rotating basis.17. The processing cluster of claim 16, wherein said memory managementunit is coupled to said coprocessors in said at least 2 compute engines.18. The processing cluster of claim 1, wherein said coprocessor includesinternal translation buffers for converting physical addresses tovirtual addresses.
 19. The processing cluster of claim 1, wherein saidset of cache memory includes a set of first tier caches coupled to asecond tier cache adapted for coupling to said main memory.
 20. Theprocessing cluster of claim 19, wherein each coprocessor is coupled to afirst tier data cache in said set of first tier caches.
 21. Theprocessing cluster of claim 20, wherein each central processing unit iscoupled to a first tier data cache in said set of first tier caches anda first tier instruction cache in said set of first tier caches.
 22. Theprocessing cluster of claim 1, wherein said set of cache memory and saidset of compute engines are formed together on a single integratedcircuit.
 23. The processing cluster of claim 1, wherein said set ofcompute engines consists of 1 compute engine.
 24. The processing clusterof claim 1, wherein said set of compute engines consists of 4 computeengines.
 25. A multi-processor system comprising: a main memory; a setof processing clusters, wherein each processing cluster in said set ofprocessing clusters includes: a set of cache memory coupled to said mainmemory, and a set of compute engines, wherein each compute engine insaid set of compute engines is coupled to said set of cache memory andincludes a central processing unit coupled to a coprocessor, whereinsaid coprocessor is coupled to said set of cache memory; and a snoopcontroller coupled to said sets of cache memory to receive and placememory requests for transferring data between said processing clustersand said main memory.
 26. The multi-processor system of claim 25,wherein said coprocessor includes: a sequencer coupled to said centralprocessing unit; and a set of application engines coupled to saidsequencer.
 27. The multi-processor system of claim 26, wherein said setof application engines includes a streaming input engine coupled to saidset of cache memory to retrieve data from said set of cache memory. 28.The multi-processor system of claim 27, wherein said streaming inputengine is coupled to at least one application engine in said set ofapplication engines to provide data retrieved from said set of cachememory.
 29. The multi-processor system of claim 26, wherein said set ofapplication engines includes a streaming output engine coupled to saidset of cache memory to store data in said set of cache memory.
 30. Themulti-processor system of claim 29, wherein said streaming output engineis coupled to at least one application engine in said set of applicationengines to store data from said at least one application engine in saidset of set of cache memory.
 31. The multi-processor system of claim 25,wherein said snoop controller is coupled to said sets of cache memoryvia a snoop ring for issuing snoop requests.
 32. The multi-processorsystem of claim 31, wherein said snoop controller is coupled to saidsets of cache memory via point-to-point links for receiving memoryrequests.
 33. The multi-processor system of claim 32, wherein processingclusters said set of processing clusters are coupled together via a dataring.
 34. The multi-processor system of claim 33, wherein said data ringis coupled to said sets of cache memory.
 35. The multi-processor systemof claim 34, wherein said sets of cache memory each include a set offirst tier caches coupled to a second tier cache coupled to said snoopcontroller and said data ring.
 36. The multi-processor system of claim25, wherein said set of processing clusters and said snoop controllerare formed together on a single integrated circuit.
 37. A computersystem including: a set of cache memory; a memory management unit; and aset of compute engines coupled to said memory management unit and saidset of cache memory, wherein each compute engine includes a centralprocessing unit coupled to a coprocessor, wherein said set of computeengines includes at least 2 compute engines.
 38. The computer system ofclaim 37, wherein each compute engine in said set of compute engines iscoupled to said memory management unit.
 39. The computer system of claim38, wherein each coprocessor in said set of set of compute engines iscoupled to said memory management unit.
 40. A processing cluster for usein a system with a main memory, said processing cluster comprising: aset of cache memory adapted for coupling to said main memory, whereinsaid set of cache memory includes: a set of first tier cache memories,and a second tier cache memory coupled to said set of first tier cachememories and adapted for coupling to said main memory; a set of computeengines, wherein each compute engine in said set of compute engines iscoupled to said set of cache memory and includes: a central processingunit coupled to cache memory in said set of first tier cache memories,and a coprocessor coupled to said central processing unit and a cachememory in said set of first tier cache memories; and a memory managementunit coupled to at least 2 compute engines in said set of computeengines.
 41. A multi-processor system comprising: a main memory; a setof processing clusters, wherein each processing cluster in said set ofprocessing clusters includes: a set of cache memory coupled to said mainmemory, said set of cache memory including: a set of first tier cachememories, and a second tier cache memory coupled to said set of firsttier cache memories; a set of compute engines, wherein each computeengine in said set of compute engines includes: a central processingunit coupled to a cache memory in said set of first tier cache memories,and a coprocessor coupled to said central processing unit and a cachememory in said set of first tier cache memories; and a snoop controllercoupled to said second tier cache memory for receiving and placingmemory requests for transferring data between said processing clustersand said main memory.